Thompson Sampling for Combinatorial Semi-Bandits
Abstract
We study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We analyze the standard TS algorithm for the general CMAB, and obtain the first distribution-dependent regret bound of $O(m\log T / \Delta_{\min}) $ for TS under general CMAB, where $m$ is the number of arms, $T$ is the time horizon, and $\Delta_{\min}$ is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. We also show that one cannot use an approximate oracle in TS algorithm for even MAB problems. Then we expand the analysis to matroid bandit, a special case of CMAB and for which we could remove the independence assumption across arms and achieve a better regret bound. Finally, we use some experiments to show the comparison of regrets of CUCB and CTS algorithms.
Cite
Text
Wang and Chen. "Thompson Sampling for Combinatorial Semi-Bandits." International Conference on Machine Learning, 2018.Markdown
[Wang and Chen. "Thompson Sampling for Combinatorial Semi-Bandits." International Conference on Machine Learning, 2018.](https://mlanthology.org/icml/2018/wang2018icml-thompson/)BibTeX
@inproceedings{wang2018icml-thompson,
title = {{Thompson Sampling for Combinatorial Semi-Bandits}},
author = {Wang, Siwei and Chen, Wei},
booktitle = {International Conference on Machine Learning},
year = {2018},
pages = {5114-5122},
volume = {80},
url = {https://mlanthology.org/icml/2018/wang2018icml-thompson/}
}