Annotation-Efficient Honesty Alignment via Confidence Elicitation and Calibration

Abstract

Honesty alignment—the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence—is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, the latter demands costly, large-scale labeling. We introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. This design substantially reduces annotation requirements while improving generalization across tasks. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations ($\sim$0.18\% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.

Cite

Text

Ni et al. "Annotation-Efficient Honesty Alignment via Confidence Elicitation and Calibration." International Conference on Learning Representations, 2026.

Markdown

[Ni et al. "Annotation-Efficient Honesty Alignment via Confidence Elicitation and Calibration." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ni2026iclr-annotationefficient/)

BibTeX

@inproceedings{ni2026iclr-annotationefficient,
  title     = {{Annotation-Efficient Honesty Alignment via Confidence Elicitation and Calibration}},
  author    = {Ni, Shiyu and Bi, Keping and Guo, Jiafeng and Tang, Minghao and Wu, Jingtong and Han, Zengxin and Cheng, Xueqi},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ni2026iclr-annotationefficient/}
}