Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Abstract

LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. Therefore, we identify 12 key potential biases and propose a new automated bias quantification framework—CALM—which systematically quantifies and analyzes each type of bias in LLM-as-a-Judge by using automated model-based and principle-guided modification. Our experiments cover several popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge applications. Even the best-performing models achieve a robustness rate of only 0.86. This highlights the need for stakeholders to address these issues and warns users to exercise caution in LLM-as-a-Judge applications.

Cite

Text

Ye et al. "Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge." NeurIPS 2024 Workshops: SafeGenAi, 2024.

Markdown

[Ye et al. "Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge." NeurIPS 2024 Workshops: SafeGenAi, 2024.](https://mlanthology.org/neuripsw/2024/ye2024neuripsw-justice/)

BibTeX

@inproceedings{ye2024neuripsw-justice,
  title     = {{Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge}},
  author    = {Ye, Jiayi and Wang, Yanbo and Huang, Yue and Chen, Dongping and Zhang, Qihui and Moniz, Nuno and Gao, Tian and Geyer, Werner and Huang, Chao and Chen, Pin-Yu and Chawla, Nitesh V and Zhang, Xiangliang},
  booktitle = {NeurIPS 2024 Workshops: SafeGenAi},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/ye2024neuripsw-justice/}
}