IR-OptSet: An Optimization-Sensitive Dataset for Advancing LLM-Based IR Optimizer
Abstract
Compiler optimization is essential for improving program performance, yet modern compilers still depend on manually crafted transformation rules over intermediate representations (IRs). As compilers grow in complexity, maintaining these rule-based optimizations becomes increasingly labor-intensive and difficult to scale. Recent advances in large language models (LLMs) offer a promising alternative, but their effectiveness in compiler optimization remains limited—primarily due to the lack of IR-oriented datasets that expose models to diverse transformation samples in real-world scenarios (*optimization-sensitive samples*), hindering LLMs from learning rich and generalizable optimization strategies. In this paper, we introduce IR-OptSet, the first public optimization-sensitive dataset for advancing LLM-based IR optimizers. It comprises 170K LLVM IR samples from open-source repositories across 8 representative optimization domains. IR-OptSet defines two core tasks: Code Analysis and Optimized Code Generation, and provides tools for correctness verification, performance evaluation, and dataset expansion. In our experiments, fine-tuning three representative LLMs on IR-OptSet leads to significant accuracy improvements across both tasks. Moreover, the LLM fine-tuned with IR-OptSet *outperforms traditional compiler with the -O3 option* in 64 test cases in terms of performance. Further analysis reveals that IR-OptSet provides greater transformation diversity and representativeness than three widely used IR-oriented datasets, highlighting its potential to drive model-based IR optimization. IR-OptSet is publicly available at [https://huggingface.co/datasets/YangziResearch/IR-OptSet](https://huggingface.co/datasets/YangziResearch/IR-OptSet).
Cite
Text
Yang et al. "IR-OptSet: An Optimization-Sensitive Dataset for Advancing LLM-Based IR Optimizer." Advances in Neural Information Processing Systems, 2025.Markdown
[Yang et al. "IR-OptSet: An Optimization-Sensitive Dataset for Advancing LLM-Based IR Optimizer." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/yang2025neurips-iroptset/)BibTeX
@inproceedings{yang2025neurips-iroptset,
title = {{IR-OptSet: An Optimization-Sensitive Dataset for Advancing LLM-Based IR Optimizer}},
author = {Yang, Zi and Qiu, Lei and Lyu, Fang and Zhong, Ming and Chai, Zhilei and Zhou, Haojie and Cui, Huimin and Feng, Xiaobing},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/yang2025neurips-iroptset/}
}