Agent-as-a-Judge: Evaluate Agents with Agents

Abstract

Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes—ignoring the step-by-step nature of the thinking done by agentic systems—or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is a natural extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving processes for more precise evaluations. We apply the Agent-as-a-Judge framework to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic AI code generation tasks. DevAI includes rich manual annotations, like a total of 365 hierarchical solution requirements, which make it particularly suitable for an agentic evaluator. We benchmark three of the top code-generating agentic systems using Agent-as-a-Judge and find that our framework dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that this work represents a concrete step towards enabling vastly more sophisticated agentic systems. To help that, our dataset and the full implementation of Agent-as-a-Judge will be publically available at https://github.com/metauto-ai/agent-as-a-judge

Cite

Text

Zhuge et al. "Agent-as-a-Judge: Evaluate Agents with Agents." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Zhuge et al. "Agent-as-a-Judge: Evaluate Agents with Agents." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/zhuge2025icml-agentasajudge/)

BibTeX

@inproceedings{zhuge2025icml-agentasajudge,
  title     = {{Agent-as-a-Judge: Evaluate Agents with Agents}},
  author    = {Zhuge, Mingchen and Zhao, Changsheng and Ashley, Dylan R. and Wang, Wenyi and Khizbullin, Dmitrii and Xiong, Yunyang and Liu, Zechun and Chang, Ernie and Krishnamoorthi, Raghuraman and Tian, Yuandong and Shi, Yangyang and Chandra, Vikas and Schmidhuber, Jürgen},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {80569-80611},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/zhuge2025icml-agentasajudge/}
}