Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis
Abstract
The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case.
Cite
Text
Kaufmann et al. "Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis." International Conference on Algorithmic Learning Theory, 2012. doi:10.1007/978-3-642-34106-9_18Markdown
[Kaufmann et al. "Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis." International Conference on Algorithmic Learning Theory, 2012.](https://mlanthology.org/alt/2012/kaufmann2012alt-thompson/) doi:10.1007/978-3-642-34106-9_18BibTeX
@inproceedings{kaufmann2012alt-thompson,
title = {{Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis}},
author = {Kaufmann, Emilie and Korda, Nathaniel and Munos, Rémi},
booktitle = {International Conference on Algorithmic Learning Theory},
year = {2012},
pages = {199-213},
doi = {10.1007/978-3-642-34106-9_18},
url = {https://mlanthology.org/alt/2012/kaufmann2012alt-thompson/}
}