Model Evaluations Need Rigorous and Transparent Human Baselines
Abstract
**This position paper argues that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance.** Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework for assessing human baselining methods. We then use our framework to systematically review 113 human baselines (studies) in foundation model evaluations, identifying shortcomings in existing baselining methods. We publish our framework as a reporting checklist for researchers conducting human baseline studies. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers.
Cite
Text
Wei et al. "Model Evaluations Need Rigorous and Transparent Human Baselines." ICLR 2025 Workshops: BuildingTrust, 2025.Markdown
[Wei et al. "Model Evaluations Need Rigorous and Transparent Human Baselines." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/wei2025iclrw-model/)BibTeX
@inproceedings{wei2025iclrw-model,
title = {{Model Evaluations Need Rigorous and Transparent Human Baselines}},
author = {Wei, Kevin and Paskov, Patricia and Dev, Sunishchal and Byun, Michael J and Reuel, Anka and Roberts-Gaal, Xavier and Calcott, Rachel and Coxon, Evie and Deshpande, Chinmay},
booktitle = {ICLR 2025 Workshops: BuildingTrust},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/wei2025iclrw-model/}
}