Minimax Policies for Adversarial and Stochastic Bandits

Abstract

We fill in a long open gap in the characterization of the minimax rate for the multi-armed bandit problem. Concretely, we remove an extraneous logarithmic factor in the previously known upper bound and propose a new family of randomized algorithms based on an implicit normalization, as well as a new analysis. We also consider the stochastic case, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays.

Cite

Text

Audibert and Bubeck. "Minimax Policies for Adversarial and Stochastic Bandits." Annual Conference on Computational Learning Theory, 2009.

Markdown

[Audibert and Bubeck. "Minimax Policies for Adversarial and Stochastic Bandits." Annual Conference on Computational Learning Theory, 2009.](https://mlanthology.org/colt/2009/audibert2009colt-minimax/)

BibTeX

@inproceedings{audibert2009colt-minimax,
  title     = {{Minimax Policies for Adversarial and Stochastic Bandits}},
  author    = {Audibert, Jean-Yves and Bubeck, Sébastien},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {2009},
  url       = {https://mlanthology.org/colt/2009/audibert2009colt-minimax/}
}