Thompson Sampling for Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox

Abstract

We consider Thompson Sampling (TS) for linear combinatorial semi-bandits and subgaussian rewards. We propose the first known TS whose finite-time regret does not scale exponentially with the dimension of the problem. We further show the mismatched sampling paradox: A learner who knows the rewards distributions and samples from the correct posterior distribution can perform exponentially worse than a learner who does not know the rewards and simply samples from a well-chosen Gaussian posterior. The code used to generate the experiments is available at https://github.com/RaymZhang/CTS-Mismatched-Paradox

Cite

Text

Zhang and Combes. "Thompson Sampling for Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox." Neural Information Processing Systems, 2024. doi:10.52202/079017-2839

Markdown

[Zhang and Combes. "Thompson Sampling for Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/zhang2024neurips-thompson/) doi:10.52202/079017-2839

BibTeX

@inproceedings{zhang2024neurips-thompson,
  title     = {{Thompson Sampling for Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox}},
  author    = {Zhang, Raymond and Combes, Richard},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-2839},
  url       = {https://mlanthology.org/neurips/2024/zhang2024neurips-thompson/}
}