Learning to Reason Under Off-Policy Guidance
Abstract
Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(RLVR). However, existing RLVR approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments RLVR with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over +6.4 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.
Cite
Text
Yan et al. "Learning to Reason Under Off-Policy Guidance." Advances in Neural Information Processing Systems, 2025.Markdown
[Yan et al. "Learning to Reason Under Off-Policy Guidance." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/yan2025neurips-learning/)BibTeX
@inproceedings{yan2025neurips-learning,
title = {{Learning to Reason Under Off-Policy Guidance}},
author = {Yan, Jianhao and Li, Yafu and Hu, Zican and Wang, Zhi and Cui, Ganqu and Qu, Xiaoye and Cheng, Yu and Zhang, Yue},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/yan2025neurips-learning/}
}