Thompson Sampling for Bandits with Cool-Down Periods
Abstract
This paper investigates a variation of dynamic bandits, characterized by arms that follow a periodic availability pattern. Upon a "successful" selection, each arm transitions to an inactive state and requires a possibly unknown cool-down period before becoming active again. We devise Thompson Sampling algorithms specifically designed for this problem, guaranteeing logarithmic regrets. Notably, this work is the first to address scenarios in which the agent lacks knowledge of each arm's active state. Furthermore, the theoretical findings extend to the sleeping bandit framework, offering a notably superior regret bound compared to existing literature.
Cite
Text
Zhu and Liu. "Thompson Sampling for Bandits with Cool-Down Periods." Transactions on Machine Learning Research, 2025.Markdown
[Zhu and Liu. "Thompson Sampling for Bandits with Cool-Down Periods." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/zhu2025tmlr-thompson/)BibTeX
@article{zhu2025tmlr-thompson,
title = {{Thompson Sampling for Bandits with Cool-Down Periods}},
author = {Zhu, Jingxuan and Liu, Bin},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/zhu2025tmlr-thompson/}
}