Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis

Abstract

The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case.

Cite

Text

Kaufmann et al. "Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis." International Conference on Algorithmic Learning Theory, 2012. doi:10.1007/978-3-642-34106-9_18

Markdown

[Kaufmann et al. "Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis." International Conference on Algorithmic Learning Theory, 2012.](https://mlanthology.org/alt/2012/kaufmann2012alt-thompson/) doi:10.1007/978-3-642-34106-9_18

BibTeX

@inproceedings{kaufmann2012alt-thompson,
  title     = {{Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis}},
  author    = {Kaufmann, Emilie and Korda, Nathaniel and Munos, Rémi},
  booktitle = {International Conference on Algorithmic Learning Theory},
  year      = {2012},
  pages     = {199-213},
  doi       = {10.1007/978-3-642-34106-9_18},
  url       = {https://mlanthology.org/alt/2012/kaufmann2012alt-thompson/}
}