Online Learning in Risk Sensitive Constrained MDP
Abstract
We consider a setting in which the agent aims to maximize the expected cumulative reward, subject to a constraint that the entropic risk of the total utility exceeds a given threshold. Unlike the risk-neutral case, standard primal-dual approaches fail to directly yield regret and violation bounds, as value iteration with respect to a combined state-action value function is not applicable in the risk-sensitive setting. To address this, we adopt the Optimized Certainty Equivalent (OCE) representation of the entropic risk measure and reformulate the problem by augmenting the state space with a continuous budget variable. We then propose a primal-dual algorithm tailored to this augmented formulation. In contrast to the standard approach for risk-neutral CMDPs, our method incorporates a truncated dual update to account for the possible absence of strong duality. We show that the proposed algorithm achieves regret of $\tilde{\mathcal{O}}\big(V_{g,\max}K^{3/4} + \sqrt{H^4 S^2 A \log(1/\delta)}K^{3/4}\big)$ and constraint violation of $\tilde{\mathcal{O}}\big(V_{g,\max} \sqrt{ {H^3 S^2 A \log(1/\delta)}}K^{3/4} \big)$ with probability at least $1-\delta$, where $S$ and $A$ denote the cardinalities of the state and action spaces, respectively, $H$ is the episode length, $K$ is the number of episodes, $\alpha < 0$ is the risk-aversion parameter, and $V_{g,\max} = \frac{1}{|\alpha|}(\exp(|\alpha|H) - 1)$. To the best of our knowledge, this is the first result establishing sublinear regret and violation bounds for the risk-sensitive CMDP problem.
Cite
Text
Ghosh and Moharrami. "Online Learning in Risk Sensitive Constrained MDP." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Ghosh and Moharrami. "Online Learning in Risk Sensitive Constrained MDP." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/ghosh2025icml-online/)BibTeX
@inproceedings{ghosh2025icml-online,
title = {{Online Learning in Risk Sensitive Constrained MDP}},
author = {Ghosh, Arnob and Moharrami, Mehrdad},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {19406-19425},
volume = {267},
url = {https://mlanthology.org/icml/2025/ghosh2025icml-online/}
}