RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts

Wijk, Hjalmar; Lin, Tao Roa; Becker, Joel; Jawhar, Sami; Parikh, Neev; Broadley, Thomas; Chan, Lawrence; Chen, Michael; Clymer, Joshua M; Dhyani, Jai; Ericheva, Elena; Garcia, Katharyn; Goodrich, Brian; Jurkovic, Nikola; Kinniment, Megan; Lajko, Aron; Nix, Seraphina; Koba Sato, Lucas Jun; Saunders, William; Taran, Maksym; West, Ben; Barnes, Elizabeth

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts

ICML 2025 pp. 66772-66832

/icml/2025/wijk2025icml-rebench/

Abstract

Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, V1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-$k$ with varying time budgets and agent designs, and find that the best AI agents achieve a score 4$\times$ higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2$\times$ the score of the top AI agent when both are given 32 total hours (across different attempts).

PDF ICML OpenReview Semantic Scholar

Cite

Text

Wijk et al. "RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Wijk et al. "RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/wijk2025icml-rebench/)

BibTeX

@inproceedings{wijk2025icml-rebench,
  title     = {{RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts}},
  author    = {Wijk, Hjalmar and Lin, Tao Roa and Becker, Joel and Jawhar, Sami and Parikh, Neev and Broadley, Thomas and Chan, Lawrence and Chen, Michael and Clymer, Joshua M and Dhyani, Jai and Ericheva, Elena and Garcia, Katharyn and Goodrich, Brian and Jurkovic, Nikola and Kinniment, Megan and Lajko, Aron and Nix, Seraphina and Koba Sato, Lucas Jun and Saunders, William and Taran, Maksym and West, Ben and Barnes, Elizabeth},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {66772-66832},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/wijk2025icml-rebench/}
}