Learning in Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

Abstract

We study the problem of learning a Nash equilibrium (NE) in an extensive game with imperfect information (EGII) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular EGII under the \textit{perfect-recall} assumption where the only feedback is realizations of the game (bandit feedback). In particular the \textit{dynamics of the EGII is not known}---we can only access it by sampling or interacting with a game simulator. For this learning setting, we provide the Implicit Exploration Online Mirror Descent (IXOMD) algorithm. It is a model-free algorithm with a high-probability bound on convergence rate to the NE of order $1/\sqrt{T}$ where~$T$ is the number of played games. Moreover IXOMD is computationally efficient as it needs to perform the updates only along the sampled trajectory.

Cite

Text

Kozuno et al. "Learning in Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall." Neural Information Processing Systems, 2021.

Markdown

[Kozuno et al. "Learning in Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/kozuno2021neurips-learning/)

BibTeX

@inproceedings{kozuno2021neurips-learning,
  title     = {{Learning in Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall}},
  author    = {Kozuno, Tadashi and Ménard, Pierre and Munos, Remi and Valko, Michal},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/kozuno2021neurips-learning/}
}