Near-Optimal Reinforcement Learning with Self-Play

NeurIPS 2020

/neurips/2020/bai2020neurips-nearoptimal/

Abstract

This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with S states, A max-player actions and B min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires \tlO(S^2AB) steps of game playing, when only highlighting the dependency on (S,A,B). In contrast, the best existing lower bound scales as \Omega(S(A+B)) and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the Nash Q-learning algorithm with sample complexity \tlO(SAB), and a new Nash V-learning algorithm with sample complexity \tlO(S(A+B)). The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games---a learning objective different from finding the Nash equilibrium.

PDF NeurIPS Semantic Scholar

Cite

Text

Bai et al. "Near-Optimal Reinforcement Learning with Self-Play." Neural Information Processing Systems, 2020.

Markdown

[Bai et al. "Near-Optimal Reinforcement Learning with Self-Play." Neural Information Processing Systems, 2020.](https://mlanthology.org/neurips/2020/bai2020neurips-nearoptimal/)

BibTeX

@inproceedings{bai2020neurips-nearoptimal,
  title     = {{Near-Optimal Reinforcement Learning with Self-Play}},
  author    = {Bai, Yu and Jin, Chi and Yu, Tiancheng},
  booktitle = {Neural Information Processing Systems},
  year      = {2020},
  url       = {https://mlanthology.org/neurips/2020/bai2020neurips-nearoptimal/}
}