EXP-Bench: Can AI Conduct AI Research Experiments?

Kon, Patrick Tser Jern; Ding, Qiuyi; Liu, Jiachen; Zhu, Xinyi; Peng, Jingjia; Xing, Jiarong; Huang, Yibo; Qiu, Yiming; Srinivasa, Jayanth; Lee, Myungjin; Chowdhury, Mosharaf; Zaharia, Matei; Chen, Ang

EXP-Bench: Can AI Conduct AI Research Experiments?

Patrick Tser Jern Kon, Qiuyi Ding, Jiachen Liu, Xinyi Zhu, Jingjia Peng, Jiarong Xing, Yibo Huang, Yiming Qiu, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Matei Zaharia, Ang Chen

ICLR 2026

/iclr/2026/kon2026iclr-expbench/

Abstract

Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading AI agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Kon et al. "EXP-Bench: Can AI Conduct AI Research Experiments?." International Conference on Learning Representations, 2026.

Markdown

[Kon et al. "EXP-Bench: Can AI Conduct AI Research Experiments?." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/kon2026iclr-expbench/)

BibTeX

@inproceedings{kon2026iclr-expbench,
  title     = {{EXP-Bench: Can AI Conduct AI Research Experiments?}},
  author    = {Kon, Patrick Tser Jern and Ding, Qiuyi and Liu, Jiachen and Zhu, Xinyi and Peng, Jingjia and Xing, Jiarong and Huang, Yibo and Qiu, Yiming and Srinivasa, Jayanth and Lee, Myungjin and Chowdhury, Mosharaf and Zaharia, Matei and Chen, Ang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/kon2026iclr-expbench/}
}