Maximin Action Identification: A New Bandit Framework for Games

Abstract

We study an original problem of pure exploration in a strategic bandit model motivated by Monte Carlo Tree Search. It consists in identifying the best action in a game, when the player may sample random outcomes of sequentially chosen pairs of actions. We propose two strategies for the fixed-confidence setting: Maximin-LUCB, based on lower-and upper-confidence bounds; and Maximin-Racing, which operates by successively eliminating the sub-optimal actions. We discuss the sample complexity of both methods and compare their performance empirically. We sketch a lower bound analysis, and possible connections to an optimal algorithm.

Cite

Text

Garivier et al. "Maximin Action Identification: A New Bandit Framework for Games." Annual Conference on Computational Learning Theory, 2016.

Markdown

[Garivier et al. "Maximin Action Identification: A New Bandit Framework for Games." Annual Conference on Computational Learning Theory, 2016.](https://mlanthology.org/colt/2016/garivier2016colt-maximin/)

BibTeX

@inproceedings{garivier2016colt-maximin,
  title     = {{Maximin Action Identification: A New Bandit Framework for Games}},
  author    = {Garivier, Aurélien and Kaufmann, Emilie and Koolen, Wouter M.},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {2016},
  pages     = {1028-1050},
  url       = {https://mlanthology.org/colt/2016/garivier2016colt-maximin/}
}