Process Reward Models That Think

Khalifa, Muhammad; Agarwal, Rishabh; Logeswaran, Lajanugen; Kim, Jaekyeom; Peng, Hao; Lee, Moontae; Lee, Honglak; Wang, Lu

Process Reward Models That Think

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang

TMLR 2026

/tmlr/2026/khalifa2026tmlr-process/

Abstract

Step-by-step verifiers—also known as process reward models (PRMs)—are a key ingredient for test-time scaling, but training them requires expensive step-level supervision. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers—using only 1% of the process labels in PRM800K—across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME ’24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation over subsets of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained with the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. This work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training.

PDF TMLR OpenReview Code Semantic Scholar

Cite

Text

Khalifa et al. "Process Reward Models That Think." Transactions on Machine Learning Research, 2026.

Markdown

[Khalifa et al. "Process Reward Models That Think." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/khalifa2026tmlr-process/)

BibTeX

@article{khalifa2026tmlr-process,
  title     = {{Process Reward Models That Think}},
  author    = {Khalifa, Muhammad and Agarwal, Rishabh and Logeswaran, Lajanugen and Kim, Jaekyeom and Peng, Hao and Lee, Moontae and Lee, Honglak and Wang, Lu},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/khalifa2026tmlr-process/}
}