PaperBench: Evaluating AI’s Ability to Replicate AI Research
Abstract
We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge’s performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.
Cite
Text
Starace et al. "PaperBench: Evaluating AI’s Ability to Replicate AI Research." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Starace et al. "PaperBench: Evaluating AI’s Ability to Replicate AI Research." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/starace2025icml-paperbench/)BibTeX
@inproceedings{starace2025icml-paperbench,
title = {{PaperBench: Evaluating AI’s Ability to Replicate AI Research}},
author = {Starace, Giulio and Jaffe, Oliver and Sherburn, Dane and Aung, James and Chan, Jun Shern and Maksin, Leon and Dias, Rachel and Mays, Evan and Kinsella, Benjamin and Thompson, Wyatt and Heidecke, Johannes and Glaese, Amelia and Patwardhan, Tejal},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {56843-56873},
volume = {267},
url = {https://mlanthology.org/icml/2025/starace2025icml-paperbench/}
}