Optimal Sample Complexity for Average Reward Markov Decision Processes
Abstract
We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of $\widetilde O(|S||A|t_{\text{mix}}^2 \epsilon^{-2})$ and a lower bound of $\Omega(|S||A|t_{\text{mix}} \epsilon^{-2})$. In these expressions, $|S|$ and $|A|$ denote the cardinalities of the state and action spaces respectively, $t_{\text{mix}}$ serves as a uniform upper limit for the total variation mixing times, and $\epsilon$ signifies the error tolerance. Therefore, a notable gap of $t_{\text{mix}}$ still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of $\widetilde O(|S||A|t_{\text{mix}}\epsilon^{-2})$. This marks the first algorithm and analysis to reach the literature's lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin \& Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.
Cite
Text
Wang et al. "Optimal Sample Complexity for Average Reward Markov Decision Processes." International Conference on Learning Representations, 2024.Markdown
[Wang et al. "Optimal Sample Complexity for Average Reward Markov Decision Processes." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/wang2024iclr-optimal/)BibTeX
@inproceedings{wang2024iclr-optimal,
title = {{Optimal Sample Complexity for Average Reward Markov Decision Processes}},
author = {Wang, Shengbo and Blanchet, Jose and Glynn, Peter},
booktitle = {International Conference on Learning Representations},
year = {2024},
url = {https://mlanthology.org/iclr/2024/wang2024iclr-optimal/}
}