Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit

Abstract

Multi-armed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. We consider a scenario where the reward distributions may change in a piecewise-stationary fashion at unknown time steps. We show that by incorporating a simple change-detection component with classic UCB algorithms to detect and adapt to changes, our so-called M-UCB algorithm can achieve nearly optimal regret bound on the order of $O(\sqrt{MKT\log T})$, where $T$ is the number of time steps, $K$ is the number of arms, and $M$ is the number of stationary segments. Comparison with the best available lower bound shows that our M-UCB is nearly optimal in $T$ up to a logarithmic factor. We also compare M-UCB with the state-of-the-art algorithms in numerical experiments using a public Yahoo! dataset and a real-world digital marketing dataset to demonstrate its superior performance.

Cite

Text

Cao et al. "Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit." Artificial Intelligence and Statistics, 2019.

Markdown

[Cao et al. "Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit." Artificial Intelligence and Statistics, 2019.](https://mlanthology.org/aistats/2019/cao2019aistats-nearly/)

BibTeX

@inproceedings{cao2019aistats-nearly,
  title     = {{Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit}},
  author    = {Cao, Yang and Wen, Zheng and Kveton, Branislav and Xie, Yao},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2019},
  pages     = {418-427},
  volume    = {89},
  url       = {https://mlanthology.org/aistats/2019/cao2019aistats-nearly/}
}