Q-Learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis

Abstract

In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents’ preferences for certain outcomes. This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees. The algorithm leverages a new, simple dynamic program (DP) decomposition for quantile MDPs. Compared with prior work, our DP decomposition requires neither known transition probabilities nor solving complex saddle point equations and serves as a suitable foundation for other model-free RL algorithms. Our numerical results in tabular domains show that our Q-learning algorithm converges to its DP variant and outperforms earlier algorithms.

Cite

Text

Hau et al. "Q-Learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis." Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, 2025.

Markdown

[Hau et al. "Q-Learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis." Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, 2025.](https://mlanthology.org/aistats/2025/hau2025aistats-qlearning/)

BibTeX

@inproceedings{hau2025aistats-qlearning,
  title     = {{Q-Learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis}},
  author    = {Hau, Jia Lin and Delage, Erick and Derman, Esther and Ghavamzadeh, Mohammad and Petrik, Marek},
  booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics},
  year      = {2025},
  pages     = {2665-2673},
  volume    = {258},
  url       = {https://mlanthology.org/aistats/2025/hau2025aistats-qlearning/}
}