On Upper-Confidence Bound Policies for Switching Bandit Problems

Abstract

Many problems, such as cognitive radio, parameter control of a scanning tunnelling microscope or internet advertisement, can be modelled as non-stationary bandit problems where the distributions of rewards changes abruptly at unknown time instants. In this paper, we analyze two algorithms designed for solving this issue: discounted UCB (D-UCB) and sliding-window UCB (SW-UCB). We establish an upper-bound for the expected regret by upper-bounding the expectation of the number of times suboptimal arms are played. The proof relies on an interesting Hoeffding type inequality for self normalized deviations with a random number of summands. We establish a lower-bound for the regret in presence of abrupt changes in the arms reward distributions. We show that the discounted UCB and the sliding-window UCB both match the lower-bound up to a logarithmic factor. Numerical simulations show that D-UCB and SW-UCB perform significantly better than existing soft-max methods like EXP3.S.

Cite

Text

Garivier and Moulines. "On Upper-Confidence Bound Policies for Switching Bandit Problems." International Conference on Algorithmic Learning Theory, 2011. doi:10.1007/978-3-642-24412-4_16

Markdown

[Garivier and Moulines. "On Upper-Confidence Bound Policies for Switching Bandit Problems." International Conference on Algorithmic Learning Theory, 2011.](https://mlanthology.org/alt/2011/garivier2011alt-upperconfidence/) doi:10.1007/978-3-642-24412-4_16

BibTeX

@inproceedings{garivier2011alt-upperconfidence,
  title     = {{On Upper-Confidence Bound Policies for Switching Bandit Problems}},
  author    = {Garivier, Aurélien and Moulines, Eric},
  booktitle = {International Conference on Algorithmic Learning Theory},
  year      = {2011},
  pages     = {174-188},
  doi       = {10.1007/978-3-642-24412-4_16},
  url       = {https://mlanthology.org/alt/2011/garivier2011alt-upperconfidence/}
}