ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

Abstract

Large language models (LLMs) have shown promise in transforming machine learning research, yet their capability to faithfully implement genuinely novel ideas from recent research papers—ideas unseen during pretraining—remains unclear. We introduce ResearchCodeBench, a benchmark that evaluates LLMs’ ability to translate cutting-edge ML contributions from top 2024-2025 research papers into executable code. We assessed 30+ proprietary and open-source LLMs, finding that even the best models correctly implement less than 40% of the code. We present empirical findings on performance comparison, contamination, and error patterns. By providing a rigorous evaluation platform, ResearchCodeBench enables continuous understanding and advancement of LLM-driven innovation in research code generation.

Cite

Text

Hua et al. "ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code." Advances in Neural Information Processing Systems, 2025.

Markdown

[Hua et al. "ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/hua2025neurips-researchcodebench/)

BibTeX

@inproceedings{hua2025neurips-researchcodebench,
  title     = {{ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code}},
  author    = {Hua, Tianyu and Hua, Harper and Xiang, Violet and Klieger, Benjamin and Truong, Sang T. and Liang, Weixin and Sun, Fan-Yun and Haber, Nick},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/hua2025neurips-researchcodebench/}
}