AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Abstract

AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they often (1) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (2) do not account for confounding variables such as model cost and tool access; (3) do not provide standardized interfaces for quick agent prototyping and evaluation; (4) fail to provide holistic, product-informed measures of real-world use cases such as science research; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides a holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.

Cite

Text

Bragg et al. "AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite." International Conference on Learning Representations, 2026.

Markdown

[Bragg et al. "AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/bragg2026iclr-astabench/)

BibTeX

@inproceedings{bragg2026iclr-astabench,
  title     = {{AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite}},
  author    = {Bragg, Jonathan and D'Arcy, Mike and Balepur, Nishant and Bareket, Dan and Mishra, Bhavana Dalvi and Feldman, Sergey and Haddad, Dany and Hwang, Jena D. and Jansen, Peter and Kishore, Varsha and Majumder, Bodhisattwa Prasad and Naik, Aakanksha and Rahamimov, Sigal and Richardson, Kyle and Singh, Amanpreet and Surana, Harshit and Tiktinsky, Aryeh and Vasu, Rosni and Wiener, Guy and Anastasiades, Chloe and Candra, Stefanus and Dunkelberger, Jason and Emery, Daniel and Evans, Rob and Hamada, Malachi and Huff, Regan and Kinney, Rodney and Latzke, Matt and Lochner, Jaron and Lozano-Aguilera, Ruben and Nguyen, Ngoc-Uyen and Rao, Smita and Tanaka, Amber and Vlahos, Brooke and Clark, Peter and Downey, Doug and Goldberg, Yoav and Sabharwal, Ashish and Weld, Daniel S},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/bragg2026iclr-astabench/}
}