Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning but training often oscillates between entropy collapse and entropy explosion. We trace both hazards to the mean-baseline used in value-free RL (\eg GRPO/DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose Quantile Advantage Estimation (QAE), replacing the mean with a group-wise $K$-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries ($p \le 1{-}K$) it reinforces rare successes, while on easy queries ($p > 1{-}K$) it targets remaining failures. Under first-order softmax updates, we prove two-sided entropy safety, giving lower/upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned $K$, roughly 80\% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME'24/'25 and AMC'23. These results identify baseline design—rather than token-level heuristics—as the primary mechanism for scaling RLVR.
Cite
Text
Wu et al. "Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning." International Conference on Learning Representations, 2026.Markdown
[Wu et al. "Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wu2026iclr-quantile/)BibTeX
@inproceedings{wu2026iclr-quantile,
title = {{Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning}},
author = {Wu, Junkang and Huang, Kexin and Wu, Jiancan and Zhang, An and Wang, Xiang and He, Xiangnan},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/wu2026iclr-quantile/}
}