Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

Abstract

Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5\% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7\% of problems that base models consistently fail to solve.

Cite

Text

Wan et al. "Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Wan et al. "Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wan2026iclr-buffer/)

BibTeX

@inproceedings{wan2026iclr-buffer,
  title     = {{Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning}},
  author    = {Wan, Xu and Wang, Yansheng and Huang, Wenqi and Sun, Mingyang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wan2026iclr-buffer/}
}