ComPile: A Large IR Dataset from Production Sources

Grossman, Aiden; Paehler, Ludger; Parasyris, Konstantinos; Ben-Nun, Tal; Hegna, Jacob; Moses, William S.; Diaz, Jose M Monsalve; Trofin, Mircea; Doerfert, Johannes

ComPile: A Large IR Dataset from Production Sources

Aiden Grossman, Ludger Paehler, Konstantinos Parasyris, Tal Ben-Nun, Jacob Hegna, William S. Moses, Jose M Monsalve Diaz, Mircea Trofin, Johannes Doerfert

DMLR 2024 pp. 1-33

/dmlr/2024/grossman2024dmlr-compile/

Abstract

Code is increasingly becoming a core data modality of modern machine learning research impacting not only the way we write codewith conversational agents like OpenAI’s ChatGPT, Google’s Bard, or Anthropic’s Claude, the way we translate code from one languageinto another, but also the compiler infrastructure underlying the language. While modeling approaches may vary and representations differ, the targeted tasks often remain the same within the individual classes of models. Yet, relying solely on the ability of modern models to extractinformation from unstructured code does not take advantage of 70 years of programming language and compiler development by not utilizing the structure inherent to programs in the data collection. This detracts from the performance of models working over a tokenized representation of input code and precludes the use of these models in the compiler itself. To work towards the first intermediaterepresentation (IR) based models, we fully utilize the LLVM compiler infrastructure, shared by a number of languages, to generatea T Llama 2 token dataset of LLVM IR. We generated this dataset from programming languages built on the shared LLVMinfrastructure, including Rust, Swift, Julia, and C/C++, by hooking into LLVM code generation either through the language’s packagemanager or the compiler directly to extract the dataset of intermediate representations from production grade programs. Statistical analysis proves the utility of our dataset not only for large language model training, but also for the introspection into the code generation process itself as well as for training of machine-learned compiler components.

PDF DMLR Semantic Scholar

Cite

Text

Grossman et al. "ComPile: A Large IR Dataset from Production Sources." Data-centric Machine Learning Research, 2024.

Markdown

[Grossman et al. "ComPile: A Large IR Dataset from Production Sources." Data-centric Machine Learning Research, 2024.](https://mlanthology.org/dmlr/2024/grossman2024dmlr-compile/)

BibTeX

@article{grossman2024dmlr-compile,
  title     = {{ComPile: A Large IR Dataset from Production Sources}},
  author    = {Grossman, Aiden and Paehler, Ludger and Parasyris, Konstantinos and Ben-Nun, Tal and Hegna, Jacob and Moses, William S. and Diaz, Jose M Monsalve and Trofin, Mircea and Doerfert, Johannes},
  journal   = {Data-centric Machine Learning Research},
  year      = {2024},
  pages     = {1-33},
  volume    = {1},
  url       = {https://mlanthology.org/dmlr/2024/grossman2024dmlr-compile/}
}