Lifting the Benchmark Iceberg with Item-Response Theory

Abstract

The evaluation of large language models (LLMs) through benchmarks has become a cornerstone of AI development, guiding critical decisions about model deployment and research directions. However, as benchmarks evolve from narrow task-specific assessments to broad capability evaluations, they become more difficult to develop, understand and analyze. Here, we report a \enquote{benchmark iceberg} phenomenon --- where much of the variability in model rankings stems not from true capability differences, but from hidden implementation choices beneath the surface of reported scores. Our analysis demonstrates how minor changes to these implementation details can alter model rankings --- a concerning finding given benchmarks' role in shaping the AI landscape. To address this, we leverage psychometric principles from educational testing. By adapting item response theory (IRT) we transform benchmarks from opaque leaderboards into transparent measurement instruments, revealing how hidden implementation choices currently distort our perception of model capabilities.

Cite

Text

Schilling-Wilhelmi et al. "Lifting the Benchmark Iceberg with Item-Response Theory." ICLR 2025 Workshops: AI4MAT, 2025.

Markdown

[Schilling-Wilhelmi et al. "Lifting the Benchmark Iceberg with Item-Response Theory." ICLR 2025 Workshops: AI4MAT, 2025.](https://mlanthology.org/iclrw/2025/schillingwilhelmi2025iclrw-lifting/)

BibTeX

@inproceedings{schillingwilhelmi2025iclrw-lifting,
  title     = {{Lifting the Benchmark Iceberg with Item-Response Theory}},
  author    = {Schilling-Wilhelmi, Mara and Alampara, Nawaf and Jablonka, Kevin Maik},
  booktitle = {ICLR 2025 Workshops: AI4MAT},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/schillingwilhelmi2025iclrw-lifting/}
}