TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

Wang, Yidong; Song, Yunze; Zhu, Tingyuan; Zhang, Xuanwang; Yu, Zhuohao; Chen, Hao; Song, Chiyu; Wang, Qiufeng; Wu, Zhen; Dai, Xinyu; Zhang, Yue; Wang, Cunxiang; Ye, Wei; Zhang, Shikun

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Zhen Wu, Xinyu Dai, Yue Zhang, Cunxiang Wang, Wei Ye, Shikun Zhang

ICLR 2026

/iclr/2026/wang2026iclr-trustjudge/

Abstract

The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) \textit{Score-Comparison Inconsistency}, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) \textit{Pairwise Transitivity Inconsistency}, manifested through circular preference chains ($A\!>\!B\!>\!C\!>\!A$) and equivalence contradictions ($A\!=\!B\!=\!C\!\neq\!A$). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose \textbf{TrustJudge}, a probabilistic framework that addresses these limitations through two key innovations: 1) \textit{distribution-sensitive scoring} that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) \textit{likelihood-aware aggregation} that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge’s components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43\% (from 23.32\% to 14.89\%) and Pairwise Transitivity inconsistency by 10.82\% (from 15.22\% to 4.40\%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Wang et al. "TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them." International Conference on Learning Representations, 2026.

Markdown

[Wang et al. "TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wang2026iclr-trustjudge/)

BibTeX

@inproceedings{wang2026iclr-trustjudge,
  title     = {{TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them}},
  author    = {Wang, Yidong and Song, Yunze and Zhu, Tingyuan and Zhang, Xuanwang and Yu, Zhuohao and Chen, Hao and Song, Chiyu and Wang, Qiufeng and Wu, Zhen and Dai, Xinyu and Zhang, Yue and Wang, Cunxiang and Ye, Wei and Zhang, Shikun},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wang2026iclr-trustjudge/}
}