Calibrating Large Language Models with Sample Consistency

Abstract

Accurately gauging the confidence level of Large Language Models' (LLMs) predictions is pivotal for their reliable application. However, LLMs are often uncalibrated inherently and elude conventional calibration techniques due to their proprietary nature and massive scale. In this work, we derive model confidence from the distribution of multiple randomly sampled generations, using three measures of consistency. We extensively evaluate eleven open and closed-source models on nine reasoning datasets. Results show that consistency-based calibration methods outperform existing post-hoc approaches in terms of calibration error. Meanwhile, we find that factors such as intermediate explanations, model scaling, and larger sample sizes enhance calibration, while instruction-tuning makes calibration more difficult. Moreover, confidence scores obtained from consistency can potentially enhance model performance. Finally, we offer guidance on choosing suitable consistency metrics for calibration, tailored to model characteristics such as the exposure to instruction-tuning and RLHF.

Cite

Text

Lyu et al. "Calibrating Large Language Models with Sample Consistency." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I18.34120

Markdown

[Lyu et al. "Calibrating Large Language Models with Sample Consistency." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/lyu2025aaai-calibrating/) doi:10.1609/AAAI.V39I18.34120

BibTeX

@inproceedings{lyu2025aaai-calibrating,
  title     = {{Calibrating Large Language Models with Sample Consistency}},
  author    = {Lyu, Qing and Shridhar, Kumar and Malaviya, Chaitanya and Zhang, Li and Elazar, Yanai and Tandon, Niket and Apidianaki, Marianna and Sachan, Mrinmaya and Callison-Burch, Chris},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {19260-19268},
  doi       = {10.1609/AAAI.V39I18.34120},
  url       = {https://mlanthology.org/aaai/2025/lyu2025aaai-calibrating/}
}