The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements
Abstract
Rapidly improving large language models (LLMs) have the potential to assist in scientific progress. One critical skill in this endeavor is the ability to faithfully reproduce existing work. To evaluate the capability of AI agents to reproduce complex code in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community's contributions to the $\textit{NanoGPT speedrun}$, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous record's training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new record's improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent frontier reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLM's ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.
Cite
Text
Zhao et al. "The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements." Advances in Neural Information Processing Systems, 2025.Markdown
[Zhao et al. "The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhao2025neurips-automated/)BibTeX
@inproceedings{zhao2025neurips-automated,
title = {{The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements}},
author = {Zhao, Bingchen and Magka, Despoina and Jiang, Minqi and Li, Xian and Raileanu, Roberta and Shavrina, Tatiana and Gagnon-Audet, Jean-Christophe and Niu, Kelvin and Sodhani, Shagun and Shvartsman, Michael and Lupu, Andrei and Lupidi, Alisia Maria and Hambardzumyan, Karen and Josifoski, Martin and Toledo, Edan and Foster, Thomas and Cipolina-Kun, Lucia and Dunfield, Derek and Charnalia, Abhishek and Miller, Alexander H and Aodha, Oisin Mac and Foerster, Jakob Nicolaus and Bachrach, Yoram},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/zhao2025neurips-automated/}
}