Large Language Model Benchmarks Do Not Test Reliability
Abstract
When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also *reliable*. Many benchmarks have been created to track LLMs’ growing capabilities. However, there has been no similar focus on measuring their reliability. To understand this landscape, we first investigate how well current benchmarks quantify model reliability. We find that pervasive label errors compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior. Motivated by this gap in the evaluation of reliability, we propose the construction of so-called platinum benchmarks that are carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures reveals previously unidentified patterns of questions on which frontier models consistently struggle.
Cite
Text
Vendrow et al. "Large Language Model Benchmarks Do Not Test Reliability." NeurIPS 2024 Workshops: SafeGenAi, 2024.Markdown
[Vendrow et al. "Large Language Model Benchmarks Do Not Test Reliability." NeurIPS 2024 Workshops: SafeGenAi, 2024.](https://mlanthology.org/neuripsw/2024/vendrow2024neuripsw-large/)BibTeX
@inproceedings{vendrow2024neuripsw-large,
title = {{Large Language Model Benchmarks Do Not Test Reliability}},
author = {Vendrow, Joshua and Vendrow, Edward and Beery, Sara and Madry, Aleksander},
booktitle = {NeurIPS 2024 Workshops: SafeGenAi},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/vendrow2024neuripsw-large/}
}