Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

Abstract

Process reward model (PRM) has been proven effective in test-time scaling of LLM on challenging reasoning tasks. However, the reward hacking induced by PRM hinders its successful applications in reinforcement fine-tuning. We find the primary cause of reward hacking induced by PRM is that: the canonical summation-form credit assignment in reinforcement learning (RL), i.e. cumulative gamma-decayed future rewards, causes the LLM to hack steps with high rewards. Therefore, to unleashing the power of PRM in training-time, we propose PURE: Process sUpervised Reinforcement lEarning. The core of PURE is the min-form credit assignment that defines the value function as the minimum future rewards. This method unifies the optimization objective with respect to process rewards during test-time and training-time, and significantly alleviates reward hacking due to the limits on the range of values of value function and more rational assignment of advantages. Through extensively experiments on 3 base models, we achieve similar reasoning performance using PRM-based approach compared with verifiable reward-based approach if enabling min-form credit assignment. In contrast, the canonical sum-form credit assignment even collapses training at the beginning. Moreover, when we incorporate 1/10th verifiable rewards to auxiliary the PRM-based fine-tuning, it further alleviate reward hacking and results in the best fine-tuned model based on Qwen2.5-Math-7B with 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Furthermore, we summary the reward hacking cases we encountered during training and analysis the cause of training collapse.

Cite

Text

Cheng et al. "Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Cheng et al. "Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/cheng2025neurips-stop/)

BibTeX

@inproceedings{cheng2025neurips-stop,
  title     = {{Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning}},
  author    = {Cheng, Jie and Xiong, Gang and Qiao, Ruixi and Li, Lijun and Guo, Chao and Wang, Junle and Lv, Yisheng and Wang, Fei-Yue},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/cheng2025neurips-stop/}
}