Not All Rollouts Are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts, maintaining learning quality while dramatically reducing update costs. We propose a principled subset selection criterion—max-variance down-sampling—that maximizes the variance of reward in the selected subset, and provide an efficient $O(n\log n)$ implementation of this rule. Empirically, Group Relative Policy Optimization (GRPO) coupled with PODS achieves the peak test accuracy of vanilla GRPO at least $\mathbf{1.7\times}$ faster across the different reasoning benchmarks and hardware configurations we tested.

Cite

Text

Xu et al. "Not All Rollouts Are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning." Transactions on Machine Learning Research, 2026.

Markdown

[Xu et al. "Not All Rollouts Are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/xu2026tmlr-all/)

BibTeX

@article{xu2026tmlr-all,
  title     = {{Not All Rollouts Are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning}},
  author    = {Xu, Yixuan Even and Savani, Yash and Fang, Fei and Kolter, J Zico},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/xu2026tmlr-all/}
}