Q-Learning with Logarithmic Regret
Abstract
This paper presents the first non-asymptotic result showing a model-free algorithm can achieve logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap. We prove that the optimistic Q-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\!\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\Delta_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $\Delta_{\min}$ is the minimum sub-optimality gap of the optimal Q-function. This bound matches the information theoretical lower bound in terms of $S,A,T$ up to a $\log\left(SA\right)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.
Cite
Text
Yang et al. "Q-Learning with Logarithmic Regret." Artificial Intelligence and Statistics, 2021.Markdown
[Yang et al. "Q-Learning with Logarithmic Regret." Artificial Intelligence and Statistics, 2021.](https://mlanthology.org/aistats/2021/yang2021aistats-qlearning/)BibTeX
@inproceedings{yang2021aistats-qlearning,
title = {{Q-Learning with Logarithmic Regret}},
author = {Yang, Kunhe and Yang, Lin and Du, Simon},
booktitle = {Artificial Intelligence and Statistics},
year = {2021},
pages = {1576-1584},
volume = {130},
url = {https://mlanthology.org/aistats/2021/yang2021aistats-qlearning/}
}