Process Reward Models That Think
Abstract
Step-by-step verifiers—also known as process reward models (PRMs)—are a key ingredient for test-time scaling, but training them requires expensive step-level supervision. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers—using only 1% of the process labels in PRM800K—across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME ’24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation over subsets of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained with the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. This work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training.
Cite
Text
Khalifa et al. "Process Reward Models That Think." Transactions on Machine Learning Research, 2026.Markdown
[Khalifa et al. "Process Reward Models That Think." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/khalifa2026tmlr-process/)BibTeX
@article{khalifa2026tmlr-process,
title = {{Process Reward Models That Think}},
author = {Khalifa, Muhammad and Agarwal, Rishabh and Logeswaran, Lajanugen and Kim, Jaekyeom and Peng, Hao and Lee, Moontae and Lee, Honglak and Wang, Lu},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/khalifa2026tmlr-process/}
}