Gap-Dependent Bounds for Two-Player Markov Games

Abstract

As one of the most popular methods in the field of reinforcement learning, Q-learning has received increasing attention. Recently, there have been more theoretical works on the regret bound of algorithms that belong to the Q-learning class in different settings. In this paper, we analyze the cumulative regret when conducting Nash Q-learning algorithm on 2-player turn-based stochastic Markov games (2-TBSG), and propose the very first gap dependent logarithmic upper bounds in the episodic tabular setting. This bound matches the theoretical lower bound only up to a logarithmic term. Furthermore, we extend the conclusion to the discounted game setting with infinite horizon and propose a similar gap dependent logarithmic regret bound. Also, under the linear MDP assumption, we obtain another logarithmic regret for 2-TBSG, in both centralized and independent settings.

Cite

Text

Dou et al. "Gap-Dependent Bounds for Two-Player Markov Games." Artificial Intelligence and Statistics, 2022.

Markdown

[Dou et al. "Gap-Dependent Bounds for Two-Player Markov Games." Artificial Intelligence and Statistics, 2022.](https://mlanthology.org/aistats/2022/dou2022aistats-gapdependent/)

BibTeX

@inproceedings{dou2022aistats-gapdependent,
  title     = {{Gap-Dependent Bounds for Two-Player Markov Games}},
  author    = {Dou, Zehao and Yang, Zhuoran and Wang, Zhaoran and Du, Simon},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2022},
  pages     = {432-455},
  volume    = {151},
  url       = {https://mlanthology.org/aistats/2022/dou2022aistats-gapdependent/}
}