BenchAgents: Automated Benchmark Creation with Agent Interaction
Abstract
Evaluations are limited by benchmark availability. As models evolve, there is a need to create benchmarks that can measure progress on new generative capabilities. However, creating new benchmarks through human annotations is slow and expensive, restricting comprehensive evaluations for any capability. We introduce $\texttt{BenchAgents}$, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities while inherently ensuring data and metric quality. $\texttt{BenchAgents}$ decomposes the benchmark creation process into planning, generation, data verification, and evaluation, each of which is executed by an LLM agent. These agents interact with each other and utilize human-in-the-loop feedback from benchmark developers to explicitly improve and flexibly control data diversity and quality. We use $\texttt{BenchAgents}$ to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation. We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
Cite
Text
Butt et al. "BenchAgents: Automated Benchmark Creation with Agent Interaction." ICLR 2025 Workshops: Data_Problems, 2025.Markdown
[Butt et al. "BenchAgents: Automated Benchmark Creation with Agent Interaction." ICLR 2025 Workshops: Data_Problems, 2025.](https://mlanthology.org/iclrw/2025/butt2025iclrw-benchagents/)BibTeX
@inproceedings{butt2025iclrw-benchagents,
title = {{BenchAgents: Automated Benchmark Creation with Agent Interaction}},
author = {Butt, Natasha and Chandrasekaran, Varun and Joshi, Neel and Nushi, Besmira and Balachandran, Vidhisha},
booktitle = {ICLR 2025 Workshops: Data_Problems},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/butt2025iclrw-benchagents/}
}