Thompson Sampling for Combinatorial Semi-Bandits

ICML 2018 pp. 5114-5122

/icml/2018/wang2018icml-thompson/

Abstract

We study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We analyze the standard TS algorithm for the general CMAB, and obtain the first distribution-dependent regret bound of $O(m\log T / \Delta_{\min}) $ for TS under general CMAB, where $m$ is the number of arms, $T$ is the time horizon, and $\Delta_{\min}$ is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. We also show that one cannot use an approximate oracle in TS algorithm for even MAB problems. Then we expand the analysis to matroid bandit, a special case of CMAB and for which we could remove the independence assumption across arms and achieve a better regret bound. Finally, we use some experiments to show the comparison of regrets of CUCB and CTS algorithms.

PDF ICML Semantic Scholar

Cite

Text

Wang and Chen. "Thompson Sampling for Combinatorial Semi-Bandits." International Conference on Machine Learning, 2018.

Markdown

[Wang and Chen. "Thompson Sampling for Combinatorial Semi-Bandits." International Conference on Machine Learning, 2018.](https://mlanthology.org/icml/2018/wang2018icml-thompson/)

BibTeX

@inproceedings{wang2018icml-thompson,
  title     = {{Thompson Sampling for Combinatorial Semi-Bandits}},
  author    = {Wang, Siwei and Chen, Wei},
  booktitle = {International Conference on Machine Learning},
  year      = {2018},
  pages     = {5114-5122},
  volume    = {80},
  url       = {https://mlanthology.org/icml/2018/wang2018icml-thompson/}
}