Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions

ALT 2020 pp. 777-826

/alt/2020/riou2020alt-bandit/

Abstract

We focus on a classic reinforcement learning problem, called a multi-armed bandit, and more specifically in the stochastic setting with reward distributions bounded in $[0,1]$. For this model, an optimal problem-dependent asymptotic regret lower bound has been derived. However, the existing algorithms achieving this regret lower bound all require to solve an optimization problem at each step, inducing a large complexity. In this paper, we propose two new algorithms, which we prove to achieve the problem-dependent asymptotic regret lower bound. The first one, which we call Multinomial TS, is an adaptation of Thompson Sampling for Bernoulli rewards to multinomial reward distributions whose support is included in $\{0, \frac{1}{M}, …, 1\}$. This algorithm achieves the regret lower bound in the case of multinomial distributions with the aforementioned support, and it can be easily generalized to bounded reward distributions in $[0, 1]$ by randomly rounding the observed rewards. The second algorithm we introduce, which we call Non-parametric TS, is a randomized algorithm but not based on the posterior sampling in the strict sense. At each step, it computes an average of the observed rewards with random weight. Not only is it asymptotically optimal, but also it performs very well even for small horizons.

PDF ALT Semantic Scholar

Cite

Text

Riou and Honda. "Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions." Proceedings of the 31st International Conference  on Algorithmic Learning Theory, 2020.

Markdown

[Riou and Honda. "Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions." Proceedings of the 31st International Conference  on Algorithmic Learning Theory, 2020.](https://mlanthology.org/alt/2020/riou2020alt-bandit/)

BibTeX

@inproceedings{riou2020alt-bandit,
  title     = {{Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions}},
  author    = {Riou, Charles and Honda, Junya},
  booktitle = {Proceedings of the 31st International Conference  on Algorithmic Learning Theory},
  year      = {2020},
  pages     = {777-826},
  volume    = {117},
  url       = {https://mlanthology.org/alt/2020/riou2020alt-bandit/}
}