Robust Average-Reward Reinforcement Learning
Abstract
Robust Markov decision processes (MDPs) aim to find a policy that optimizes the worst-case performance over an uncertainty set of MDPs. Existing studies mostly have focused on the robust MDPs under the discounted reward criterion, leaving the ones under the average-reward criterion largely unexplored. In this paper, we develop the first comprehensive and systematic study of robust average-reward MDPs, where the goal is to optimize the long-term average performance under the worst case. Our contributions are four-folds: (1) we prove the uniform convergence of the robust discounted value function to the robust average-reward function as the discount factor γ goes to 1; (2) we derive the robust average-reward Bellman equation, characterize the structure of its solution set, and prove the equivalence between solving the robust Bellman equation and finding the optimal robust policy; (3) we design robust dynamic programming algorithms, and theoretically characterize their convergence to the optimal policy; and (4) we design two model-free algorithms unitizing the multi-level Monte-Carlo approach, and prove their asymptotic convergence
Cite
Text
Wang et al. "Robust Average-Reward Reinforcement Learning." Journal of Artificial Intelligence Research, 2024. doi:10.1613/JAIR.1.15451Markdown
[Wang et al. "Robust Average-Reward Reinforcement Learning." Journal of Artificial Intelligence Research, 2024.](https://mlanthology.org/jair/2024/wang2024jair-robust/) doi:10.1613/JAIR.1.15451BibTeX
@article{wang2024jair-robust,
title = {{Robust Average-Reward Reinforcement Learning}},
author = {Wang, Yue and Velasquez, Alvaro and Atia, George K. and Prater-Bennette, Ashley and Zou, Shaofeng},
journal = {Journal of Artificial Intelligence Research},
year = {2024},
pages = {719-803},
doi = {10.1613/JAIR.1.15451},
volume = {80},
url = {https://mlanthology.org/jair/2024/wang2024jair-robust/}
}