Reasoning Is Not a Race: When Stopping Early Beats Going Deeper

Abstract

We study the use of Process Reward Models (PRMs) for guiding Long Chain-of-Thought (CoT) reasoning in large language models. Although PRMs deliver fine-grained feedback in standard tasks, PRM-guided beam search does not consistently outperform PRM-free approaches in long CoT reasoning. We trace this shortfall to a "step quality degradation''—the expected step quality shows concave behavior, yielding unimodal or monotonically declining trends. To counteract this, we propose Z-Score Guided Early Stopping (ZGES), which halts search at the detected quality peak using local PRM-reward z-scores. Across multiple math benchmarks and model scales, ZGES outperforms both standard PRM-guided beam search and the PRM-free methods. Ablation studies further highlight the advantages and robustness of ZGES’s adaptive stopping mechanism.

Cite

Text

Zhang et al. "Reasoning Is Not a Race: When Stopping Early Beats Going Deeper." Advances in Neural Information Processing Systems, 2025.

Markdown

[Zhang et al. "Reasoning Is Not a Race: When Stopping Early Beats Going Deeper." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhang2025neurips-reasoning/)

BibTeX

@inproceedings{zhang2025neurips-reasoning,
  title     = {{Reasoning Is Not a Race: When Stopping Early Beats Going Deeper}},
  author    = {Zhang, Mohan and Gao, Jiaxuan and Xu, Shusheng and Wu, Yi},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/zhang2025neurips-reasoning/}
}