Maximin Action Identification: A New Bandit Framework for Games
Abstract
We study an original problem of pure exploration in a strategic bandit model motivated by Monte Carlo Tree Search. It consists in identifying the best action in a game, when the player may sample random outcomes of sequentially chosen pairs of actions. We propose two strategies for the fixed-confidence setting: Maximin-LUCB, based on lower-and upper-confidence bounds; and Maximin-Racing, which operates by successively eliminating the sub-optimal actions. We discuss the sample complexity of both methods and compare their performance empirically. We sketch a lower bound analysis, and possible connections to an optimal algorithm.
Cite
Text
Garivier et al. "Maximin Action Identification: A New Bandit Framework for Games." Annual Conference on Computational Learning Theory, 2016.Markdown
[Garivier et al. "Maximin Action Identification: A New Bandit Framework for Games." Annual Conference on Computational Learning Theory, 2016.](https://mlanthology.org/colt/2016/garivier2016colt-maximin/)BibTeX
@inproceedings{garivier2016colt-maximin,
title = {{Maximin Action Identification: A New Bandit Framework for Games}},
author = {Garivier, Aurélien and Kaufmann, Emilie and Koolen, Wouter M.},
booktitle = {Annual Conference on Computational Learning Theory},
year = {2016},
pages = {1028-1050},
url = {https://mlanthology.org/colt/2016/garivier2016colt-maximin/}
}