Robust Average-Reward Reinforcement Learning

Abstract

Robust Markov decision processes (MDPs) aim to find a policy that optimizes the worst-case performance over an uncertainty set of MDPs. Existing studies mostly have focused on the robust MDPs under the discounted reward criterion, leaving the ones under the average-reward criterion largely unexplored. In this paper, we develop the first comprehensive and systematic study of robust average-reward MDPs, where the goal is to optimize the long-term average performance under the worst case. Our contributions are four-folds: (1) we prove the uniform convergence of the robust discounted value function to the robust average-reward function as the discount factor γ goes to 1; (2) we derive the robust average-reward Bellman equation, characterize the structure of its solution set, and prove the equivalence between solving the robust Bellman equation and finding the optimal robust policy; (3) we design robust dynamic programming algorithms, and theoretically characterize their convergence to the optimal policy; and (4) we design two model-free algorithms unitizing the multi-level Monte-Carlo approach, and prove their asymptotic convergence

Cite

Text

Wang et al. "Robust Average-Reward Reinforcement Learning." Journal of Artificial Intelligence Research, 2024. doi:10.1613/JAIR.1.15451

Markdown

[Wang et al. "Robust Average-Reward Reinforcement Learning." Journal of Artificial Intelligence Research, 2024.](https://mlanthology.org/jair/2024/wang2024jair-robust/) doi:10.1613/JAIR.1.15451

BibTeX

@article{wang2024jair-robust,
  title     = {{Robust Average-Reward Reinforcement Learning}},
  author    = {Wang, Yue and Velasquez, Alvaro and Atia, George K. and Prater-Bennette, Ashley and Zou, Shaofeng},
  journal   = {Journal of Artificial Intelligence Research},
  year      = {2024},
  pages     = {719-803},
  doi       = {10.1613/JAIR.1.15451},
  volume    = {80},
  url       = {https://mlanthology.org/jair/2024/wang2024jair-robust/}
}