Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Abstract

Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model's own logits as $Q$-values. TBRM removes the need for critics, importance-sampling ratios, or clipping, and can operate with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM matches or surpasses policy-based baselines, like PPO and GRPO, with comparable or lower computational and memory overhead. Our results indicate that value-based RL might be a principled and efficient alternative for enhancing reasoning capabilities in LLMs. The codebase for TBRM is publicly available at [https://github.com/rlx-lab/TBRM](https://github.com/rlx-lab/TBRM).

Cite

Text

Yuan et al. "Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Yuan et al. "Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/yuan2025neurips-trajectory/)

BibTeX

@inproceedings{yuan2025neurips-trajectory,
  title     = {{Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning}},
  author    = {Yuan, Yurun and Chen, Fan and Jia, Zeyu and Rakhlin, Alexander and Xie, Tengyang},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/yuan2025neurips-trajectory/}
}