Regret Guarantees for Model-Based Reinforcement Learning with Long-Term Average Constraints
Abstract
We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with an ergodic Markov Decision Process. At every interaction, the agent obtains a reward and incurs $K$ costs. The agent aims to maximize the long-term average reward while simultaneously keeping the $K$ long-term average costs lower than a certain threshold. In this paper, we propose \NAM, a posterior sampling based algorithm using which the agent can learn optimal policies to interact with the CMDP. We show that with the assumption of slackness, characterized by $\kappa$, the optimization problem is feasible for the sampled MDPs. Further, for MDP with $S$ states, $A$ actions, and mixing time $T_M$, we prove that following \NAM{} algorithm, the agent can bound the regret of not accumulating rewards from an optimal policy by $\Tilde{O}(T_MS\sqrt{AT})$. Further, we show that the violations for any of the $K$ constraints is also bounded by $\Tilde{O}(T_MS\sqrt{AT})$. To the best of our knowledge, this is the first work that obtains a $\Tilde{O}(\sqrt{T})$ regret bounds for ergodic MDPs with long-term average constraints using a posterior sampling method.
Cite
Text
Agarwal et al. "Regret Guarantees for Model-Based Reinforcement Learning with Long-Term Average Constraints." Uncertainty in Artificial Intelligence, 2022.Markdown
[Agarwal et al. "Regret Guarantees for Model-Based Reinforcement Learning with Long-Term Average Constraints." Uncertainty in Artificial Intelligence, 2022.](https://mlanthology.org/uai/2022/agarwal2022uai-regret/)BibTeX
@inproceedings{agarwal2022uai-regret,
title = {{Regret Guarantees for Model-Based Reinforcement Learning with Long-Term Average Constraints}},
author = {Agarwal, Mridul and Bai, Qinbo and Aggarwal, Vaneet},
booktitle = {Uncertainty in Artificial Intelligence},
year = {2022},
pages = {22-31},
volume = {180},
url = {https://mlanthology.org/uai/2022/agarwal2022uai-regret/}
}