Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

Abstract

AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work (Figure 1). We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.

Cite

Text

Kapoor et al. "Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation." International Conference on Learning Representations, 2026.

Markdown

[Kapoor et al. "Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/kapoor2026iclr-holistic/)

BibTeX

@inproceedings{kapoor2026iclr-holistic,
  title     = {{Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation}},
  author    = {Kapoor, Sayash and Stroebl, Benedikt and Kirgis, Peter and Nadgir, Nitya and Siegel, Zachary S and Wei, Boyi and Xue, Tianci and Chen, Ziru and Chen, Felix and Utpala, Saiteja and Ndzomga, Franck and Oruganty, Dheeraj and Luskin, Sophie and Liu, Kangheng and Yu, Botao and Arora, Amit and Hahm, Dongyoon and Trivedi, Harsh and Sun, Huan and Lee, Juyong and Jin, Tengjun and Mai, Yifan and Zhou, Yifei and Zhu, Yuxuan and Bommasani, Rishi and Kang, Daniel and Song, Dawn and Henderson, Peter and Su, Yu and Liang, Percy and Narayanan, Arvind},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/kapoor2026iclr-holistic/}
}