$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

Zhou, Jin Peng; Wang, Kaiwen; Chang, Jonathan Daniel; Gao, Zhaolin; Kallus, Nathan; Weinberger, Kilian Q; Brantley, Kianté; Sun, Wen

$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

Jin Peng Zhou, Kaiwen Wang, Jonathan Daniel Chang, Zhaolin Gao, Nathan Kallus, Kilian Q Weinberger, Kianté Brantley, Wen Sun

NeurIPS 2025

/neurips/2025/zhou2025neurips-provably/

Abstract

Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Q\sharp$, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal $Q$ function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized $Q$-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, $Q\sharp$ outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight $Q\sharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at \url{https://github.com/jinpz/q_sharp}.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Zhou et al. "$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training." Advances in Neural Information Processing Systems, 2025.

Markdown

[Zhou et al. "$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhou2025neurips-provably/)

BibTeX

@inproceedings{zhou2025neurips-provably,
  title     = {{$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training}},
  author    = {Zhou, Jin Peng and Wang, Kaiwen and Chang, Jonathan Daniel and Gao, Zhaolin and Kallus, Nathan and Weinberger, Kilian Q and Brantley, Kianté and Sun, Wen},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/zhou2025neurips-provably/}
}