Explaining Length Bias in LLM-Based Preference Evaluations

Abstract

The use of large language models (LLMs) as judges, particularly in preference comparisons, has become widespread, but this reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the *win rate*, into two key components: *desirability* and *information mass*, where the former is length-independent and related to trustworthiness such as correctness, toxicity, and consistency, and the latter is length-dependent and represents the amount of information in the response. We empirically demonstrated the decomposition through controlled experiments and found that response length impacts evaluations by influencing information mass. To derive a reliable evaluation metric that assesses content quality without being confounded by response length, we propose **AdapAlpaca**, a simple yet effective adjustment to win rate measurement. Specifically, **AdapAlpaca** ensures a fair comparison of response quality by aligning the lengths of reference and test model responses under equivalent length intervals.

Cite

Text

Hu et al. "Explaining Length Bias in LLM-Based Preference Evaluations." ICLR 2025 Workshops: Data_Problems, 2025.

Markdown

[Hu et al. "Explaining Length Bias in LLM-Based Preference Evaluations." ICLR 2025 Workshops: Data_Problems, 2025.](https://mlanthology.org/iclrw/2025/hu2025iclrw-explaining/)

BibTeX

@inproceedings{hu2025iclrw-explaining,
  title     = {{Explaining Length Bias in LLM-Based Preference Evaluations}},
  author    = {Hu, Zhengyu and Song, Linxin and Zhang, Jieyu and Xiao, Zheyuan and Chen, Zhengyu and Xiong, Hui},
  booktitle = {ICLR 2025 Workshops: Data_Problems},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/hu2025iclrw-explaining/}
}