Value-Guided Search for Efficient Chain-of-Thought Reasoning

Abstract

In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of ``step,'' which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (\texttt{VGS}) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-$n$. Moreover, \texttt{VGS} significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced at \codeurl.

Cite

Text

Wang et al. "Value-Guided Search for Efficient Chain-of-Thought Reasoning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Wang et al. "Value-Guided Search for Efficient Chain-of-Thought Reasoning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/wang2025neurips-valueguided/)

BibTeX

@inproceedings{wang2025neurips-valueguided,
  title     = {{Value-Guided Search for Efficient Chain-of-Thought Reasoning}},
  author    = {Wang, Kaiwen and Zhou, Jin Peng and Chang, Jonathan Daniel and Gao, Zhaolin and Kallus, Nathan and Brantley, Kianté and Sun, Wen},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/wang2025neurips-valueguided/}
}