Choices Speak Louder than Questions

Abstract

Recent findings raise concerns about whether the evaluation of Multiple-Choice Question Answering (MCQA) accurately reflects the comprehension abilities of large language models. This paper explores the concept of \textit{choice sensitivity}, which refers to the tendency for model decisions to be more influenced by the answer options than by a genuine understanding of the question. We introduce a new scoring method called **Normalized Probability Shift by the Question (NPSQ)**, designed to isolate the impact of the question itself and provide a more reliable assessment of comprehension. Through experiments involving various input formats, including cloze, symbols, and hybrid formats, we find that traditional scoring methods — such as those based on log-likelihood or its length-normalized variant — are vulnerable to superficial characteristics of the answer choices. In contrast, NPSQ remains stable even when modifications are made to the answer options.

Cite

Text

Cho et al. "Choices Speak Louder than Questions." International Conference on Learning Representations, 2026.

Markdown

[Cho et al. "Choices Speak Louder than Questions." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/cho2026iclr-choices/)

BibTeX

@inproceedings{cho2026iclr-choices,
  title     = {{Choices Speak Louder than Questions}},
  author    = {Cho, Gyeongje and So, Yeonkyoung and Lee, Jaejin},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/cho2026iclr-choices/}
}