Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs
Abstract
We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$\gamma$, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$\gamma$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $\gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$\gamma$ is nearly minimax optimal for discounted MDPs.
Cite
Text
He et al. "Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs." Neural Information Processing Systems, 2021.Markdown
[He et al. "Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/he2021neurips-nearly/)BibTeX
@inproceedings{he2021neurips-nearly,
title = {{Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs}},
author = {He, Jiafan and Zhou, Dongruo and Gu, Quanquan},
booktitle = {Neural Information Processing Systems},
year = {2021},
url = {https://mlanthology.org/neurips/2021/he2021neurips-nearly/}
}