Learning with Bandit Feedback in Potential Games

Abstract

This paper examines the equilibrium convergence properties of no-regret learning with exponential weights in potential games. To establish convergence with minimal information requirements on the players' side, we focus on two frameworks: the semi-bandit case (where players have access to a noisy estimate of their payoff vectors, including strategies they did not play), and the bandit case (where players are only able to observe their in-game, realized payoffs). In the semi-bandit case, we show that the induced sequence of play converges almost surely to a Nash equilibrium at a quasi-exponential rate. In the bandit case, the same result holds for approximate Nash equilibria if we introduce a constant exploration factor that guarantees that action choice probabilities never become arbitrarily small. In particular, if the algorithm is run with a suitably decreasing exploration factor, the sequence of play converges to a bona fide Nash equilibrium with probability 1.

Cite

Text

Heliou et al. "Learning with Bandit Feedback in Potential Games." Neural Information Processing Systems, 2017.

Markdown

[Heliou et al. "Learning with Bandit Feedback in Potential Games." Neural Information Processing Systems, 2017.](https://mlanthology.org/neurips/2017/heliou2017neurips-learning/)

BibTeX

@inproceedings{heliou2017neurips-learning,
  title     = {{Learning with Bandit Feedback in Potential Games}},
  author    = {Heliou, Amélie and Cohen, Johanne and Mertikopoulos, Panayotis},
  booktitle = {Neural Information Processing Systems},
  year      = {2017},
  pages     = {6369-6378},
  url       = {https://mlanthology.org/neurips/2017/heliou2017neurips-learning/}
}