Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Abstract

Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.

Cite

Text

Gou et al. "Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge." Advances in Neural Information Processing Systems, 2025.

Markdown

[Gou et al. "Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/gou2025neurips-mind2web/)

BibTeX

@inproceedings{gou2025neurips-mind2web,
  title     = {{Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge}},
  author    = {Gou, Boyu and Huang, Zanming and Ning, Yuting and Gu, Yu and Lin, Michael and Qi, Weijian and Kopanev, Andrei and Yu, Botao and Gutierrez, Bernal Jimenez and Shu, Yiheng and Song, Chan Hee and Wu, Jiaman and Chen, Shijie and Moussa, Hanane Nour and Zhang, Tianshu and Xie, Jian and Li, Yifei and Xue, Tianci and Liao, Zeyi and Zhang, Kai and Zheng, Boyuan and Cai, Zhaowei and Rozgic, Viktor and Ziyadi, Morteza and Sun, Huan and Su, Yu},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/gou2025neurips-mind2web/}
}