Trust, but Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards
Abstract
Large Language Models (LLMs) show great promise in complex reasoning, with Reinforcement Learning with Verifiable Rewards (RLVR) being a key enhancement strategy. However, a prevalent issue is ``superficial self-reflection'', where models fail to robustly verify their own outputs. We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this. RISE explicitly and simultaneously trains an LLM to improve both its problem-solving and self-verification abilities within a single, integrated RL process. The core mechanism involves leveraging verifiable rewards from an outcome verifier to provide on-the-fly feedback for both solution generation and self-verification tasks. In each iteration, the model generates solutions, then critiques its own on-policy generated solutions, with both trajectories contributing to the policy update. Extensive experiments on diverse mathematical reasoning benchmarks show that RISE consistently improves model's problem-solving accuracy while concurrently fostering strong self-verification skills. Our analyses highlight the advantages of online verification and the benefits of increased verification compute. Additionally, RISE models exhibit more frequent and accurate self-verification behaviors during reasoning. These advantages reinforce RISE as a flexible and effective path towards developing more robust and self-aware reasoners.
Cite
Text
Liu et al. "Trust, but Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards." Advances in Neural Information Processing Systems, 2025.Markdown
[Liu et al. "Trust, but Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/liu2025neurips-trust/)BibTeX
@inproceedings{liu2025neurips-trust,
title = {{Trust, but Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards}},
author = {Liu, Xiaoyuan and Liang, Tian and He, Zhiwei and Xu, Jiahao and Wang, Wenxuan and He, Pinjia and Tu, Zhaopeng and Mi, Haitao and Yu, Dong},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/liu2025neurips-trust/}
}