Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Abstract

LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-by-step reasoning process that underlies the final evaluation of a response. However, due to the lack of human-annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench and PPE, despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.

Cite

Text

Saha et al. "Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Saha et al. "Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/saha2025icml-learning/)

BibTeX

@inproceedings{saha2025icml-learning,
  title     = {{Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge}},
  author    = {Saha, Swarnadeep and Li, Xian and Ghazvininejad, Marjan and Weston, Jason E and Wang, Tianlu},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {52565-52583},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/saha2025icml-learning/}
}