Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning

Wu, Junkang; Huang, Kexin; Wu, Jiancan; Zhang, An; Wang, Xiang; He, Xiangnan

Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning

Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He

ICLR 2026

/iclr/2026/wu2026iclr-quantile/

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning but training often oscillates between entropy collapse and entropy explosion. We trace both hazards to the mean-baseline used in value-free RL (\eg GRPO/DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose Quantile Advantage Estimation (QAE), replacing the mean with a group-wise $K$-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries ($p \le 1{-}K$) it reinforces rare successes, while on easy queries ($p > 1{-}K$) it targets remaining failures. Under first-order softmax updates, we prove two-sided entropy safety, giving lower/upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned $K$, roughly 80\% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME'24/'25 and AMC'23. These results identify baseline design—rather than token-level heuristics—as the primary mechanism for scaling RLVR.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Wu et al. "Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Wu et al. "Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wu2026iclr-quantile/)

BibTeX

@inproceedings{wu2026iclr-quantile,
  title     = {{Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning}},
  author    = {Wu, Junkang and Huang, Kexin and Wu, Jiancan and Zhang, An and Wang, Xiang and He, Xiangnan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wu2026iclr-quantile/}
}