Independent Policy Gradient Methods for Competitive Reinforcement Learning

Abstract

We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents (i.e., zero-sum stochastic games). We consider an episodic setting where in each episode, each player independently selects a policy and observes only their own actions and rewards, along with the state. We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule (which is necessary). To the best of our knowledge, this constitutes the first finite-sample convergence result for independent learning in competitive RL, as prior work has largely focused on centralized/coordinated procedures for equilibrium computation.

Cite

Text

Daskalakis et al. "Independent Policy Gradient Methods for Competitive Reinforcement Learning." Neural Information Processing Systems, 2020.

Markdown

[Daskalakis et al. "Independent Policy Gradient Methods for Competitive Reinforcement Learning." Neural Information Processing Systems, 2020.](https://mlanthology.org/neurips/2020/daskalakis2020neurips-independent/)

BibTeX

@inproceedings{daskalakis2020neurips-independent,
  title     = {{Independent Policy Gradient Methods for Competitive Reinforcement Learning}},
  author    = {Daskalakis, Constantinos and Foster, Dylan J and Golowich, Noah},
  booktitle = {Neural Information Processing Systems},
  year      = {2020},
  url       = {https://mlanthology.org/neurips/2020/daskalakis2020neurips-independent/}
}