Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
Abstract
LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-by-step reasoning process that underlies the final evaluation of a response. However, due to the lack of human-annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench and PPE, despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.
Cite
Text
Saha et al. "Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Saha et al. "Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/saha2025icml-learning/)BibTeX
@inproceedings{saha2025icml-learning,
title = {{Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge}},
author = {Saha, Swarnadeep and Li, Xian and Ghazvininejad, Marjan and Weston, Jason E and Wang, Tianlu},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {52565-52583},
volume = {267},
url = {https://mlanthology.org/icml/2025/saha2025icml-learning/}
}