Kullback-Leibler Maillard Sampling for Multi-Armed Bandits with Bounded Rewards

Abstract

We study $K$-armed bandit problems where the reward distributions of the arms are all supported on the $[0,1]$ interval. Maillard sampling\cite{maillard13apprentissage}, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting\cite{bian2022maillard} while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we analyze the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling {and a special case of Minimum Empirical Divergence (MED)~\cite{honda2011asymptotically}} for achieving a KL-style finite-time gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has an adaptive worst-case regret bound of the form $O(\sqrt{\mu^*(1-\mu^*) K T \ln K} + K \ln T)$, where $\mu^*$ is the expected reward of the optimal arm, and $T$ is the time horizon length; this is the first time such adaptivity is reported in the literature for an algorithm with asymptotic optimality guarantees.

Cite

Text

Qin et al. "Kullback-Leibler Maillard Sampling for Multi-Armed Bandits with Bounded Rewards." Neural Information Processing Systems, 2023.

Markdown

[Qin et al. "Kullback-Leibler Maillard Sampling for Multi-Armed Bandits with Bounded Rewards." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/qin2023neurips-kullbackleibler/)

BibTeX

@inproceedings{qin2023neurips-kullbackleibler,
  title     = {{Kullback-Leibler Maillard Sampling for Multi-Armed Bandits with Bounded Rewards}},
  author    = {Qin, Hao and Jun, Kwang-Sung and Zhang, Chicheng},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/qin2023neurips-kullbackleibler/}
}