Best Arm Identification in Multi-Armed Bandits

Abstract

We consider the problem of finding the best arm in a stochastic multi-armed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. We propose a highly exploring UCB policy and a new algorithm based on successive rejects. We show that these algorithms are essentially optimal since their regret decreases exponentially at a rate which is, up to a logarithmic factor, the best possible. However, while the UCB policy needs the tuning of a parameter depending on the unobservable hardness of the task, the successive rejects policy benefits from being parameter-free, and also independent of the scaling of the rewards. As a by-product of our analysis, we show that identifying the best arm (when it is unique) requires a number of samples of order (up to a log(K) factor) P i 1/$\Delta$2 i , where the sum is on the suboptimal arms and $\Delta$i represents the difference between the mean reward of the best arm and the one of arm i. This generalizes the well-known fact that one needs of order of 1/$\Delta$2 samples to differentiate the means of two distributions with gap $\Delta$.

Cite

Text

Audibert et al. "Best Arm Identification in Multi-Armed Bandits." Annual Conference on Computational Learning Theory, 2010.

Markdown

[Audibert et al. "Best Arm Identification in Multi-Armed Bandits." Annual Conference on Computational Learning Theory, 2010.](https://mlanthology.org/colt/2010/audibert2010colt-best/)

BibTeX

@inproceedings{audibert2010colt-best,
  title     = {{Best Arm Identification in Multi-Armed Bandits}},
  author    = {Audibert, Jean-Yves and Bubeck, Sébastien and Munos, Rémi},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {2010},
  pages     = {41-53},
  url       = {https://mlanthology.org/colt/2010/audibert2010colt-best/}
}