MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
Abstract
We introduce **MLRC-Bench**, a benchmark designed to quantify how effectively language agents can tackle challenging **M**achine **L**earning (ML) **R**esearch **C**ompetitions, with a focus on open research problems that demand novel methodologies. Unlike prior work, e.g., AI Scientist, which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the *LLM-judged* innovation and their *actual* performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, which is designed to continually grow with new ML competitions to encourage rigorous and objective evaluations of AI’s research capabilities. Our leaderboard and code are publicly available at https://huggingface.co/spaces/launch/MLRC_Bench.
Cite
Text
Zhang et al. "MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?." Advances in Neural Information Processing Systems, 2025.Markdown
[Zhang et al. "MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhang2025neurips-mlrcbench/)BibTeX
@inproceedings{zhang2025neurips-mlrcbench,
title = {{MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?}},
author = {Zhang, Yunxiang and Khalifa, Muhammad and Bhushan, Shitanshu and Murphy, Grant D and Logeswaran, Lajanugen and Kim, Jaekyeom and Lee, Moontae and Lee, Honglak and Wang, Lu},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/zhang2025neurips-mlrcbench/}
}