Investigating Non-Transitivity in LLM-as-a-Judge

Abstract

Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% $\rightarrow$ 96.4% and 82.1% $\rightarrow$ 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.

Cite

Text

Xu et al. "Investigating Non-Transitivity in LLM-as-a-Judge." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Xu et al. "Investigating Non-Transitivity in LLM-as-a-Judge." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/xu2025icml-investigating/)

BibTeX

@inproceedings{xu2025icml-investigating,
  title     = {{Investigating Non-Transitivity in LLM-as-a-Judge}},
  author    = {Xu, Yi and Ruis, Laura and Rocktäschel, Tim and Kirk, Robert},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {69583-69612},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/xu2025icml-investigating/}
}