A Policy-Gradient Approach to Solving Imperfect-Information Games with Best-Iterate Convergence

Abstract

Policy gradient methods have become a staple of any single-agent reinforcement learning toolbox, due to their combination of desirable properties: iterate convergence, efficient use of stochastic trajectory feedback, and theoretically-sound avoidance of importance sampling corrections. In multi-agent imperfect-information settings (extensive-form games), however, it is still unknown whether the same desiderata can be guaranteed while retaining theoretical guarantees. Instead, sound methods for extensive-form games rely on approximating \emph{counterfactual} values (as opposed to Q values), which are incompatible with policy gradient methodologies. In this paper, we investigate whether policy gradient can be safely used in two-player zero-sum imperfect-information extensive-form games (EFGs). We establish positive results, showing for the first time that a policy gradient method leads to provable best-iterate convergence to a regularized Nash equilibrium in self-play.

Cite

Text

Liu et al. "A Policy-Gradient Approach to Solving Imperfect-Information Games with Best-Iterate Convergence." International Conference on Learning Representations, 2025.

Markdown

[Liu et al. "A Policy-Gradient Approach to Solving Imperfect-Information Games with Best-Iterate Convergence." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/liu2025iclr-policygradient/)

BibTeX

@inproceedings{liu2025iclr-policygradient,
  title     = {{A Policy-Gradient Approach to Solving Imperfect-Information Games with Best-Iterate Convergence}},
  author    = {Liu, Mingyang and Farina, Gabriele and Ozdaglar, Asuman E.},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/liu2025iclr-policygradient/}
}