Can We Count on LLMs? the Fixed-Effect Fallacy and Claims of GPT-4 Capabilities

TMLR 2024

/tmlr/2024/ball2024tmlr-we/

Abstract

In this paper we explore evaluation of LLM capabilities. We present measurements of GPT-4 performance on several deterministic tasks; each task involves a basic calculation and takes as input parameter some element drawn from a large well-defined population (e.g., count elements in a list, multiply two k-digit numbers, etc). We examine several conditions per-task and perform enough trials so that statistically significant differences can be detected. This allows us to investigate the sensitivity of task-accuracy both to query phrasing and input parameter population. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects. For example, performance on a simple list-counting task varies with query-phrasing and list-length, but also with list composition (i.e., the thing-to-be-counted) and object frequency e.g., success when an element accounts for ≈ 50\% of a list is different from when it accounts for ≈ 70\% etc). We conclude that efforts to quantify LLM capabilities easily succumb to the language-as-fixed-effect fallacy, where experimental observations are improperly generalized beyond what the data supports. A consequence appears to be that intuitions that have been formed based on interactions with humans form a very unreliable guide as to which input modifications should ``make no difference'' to LLM performance.

PDF TMLR Semantic Scholar

Cite

Text

Ball et al. "Can We Count on LLMs? the Fixed-Effect Fallacy and Claims of GPT-4 Capabilities." Transactions on Machine Learning Research, 2024.

Markdown

[Ball et al. "Can We Count on LLMs? the Fixed-Effect Fallacy and Claims of GPT-4 Capabilities." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/ball2024tmlr-we/)

BibTeX

@article{ball2024tmlr-we,
  title     = {{Can We Count on LLMs? the Fixed-Effect Fallacy and Claims of GPT-4 Capabilities}},
  author    = {Ball, Thomas and Chen, Shuo and Herley, Cormac},
  journal   = {Transactions on Machine Learning Research},
  year      = {2024},
  url       = {https://mlanthology.org/tmlr/2024/ball2024tmlr-we/}
}