Smooth Non-Stationary Bandits

Abstract

In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time, where they guarantee $T^{2/3}$ regret. However, in practice environments are often changing smoothly, so such algorithms may incur higher-than-necessary regret in these settings and do not leverage information on the rate of change. In this paper, we study a non-stationary two-arm bandit problem where we assume an arm’s mean reward is a $\beta$-Hölder function over (normalized) time, meaning it is $(\beta-1)$-times Lipschitz-continuously differentiable. We show the first separation between the smooth and non-smooth regimes by presenting a policy with $T^{3/5}$ regret for $\beta=2$. We complement this result by a $T^{\frac{\beta+1}{2\beta+1}}$ lower bound for any integer $\beta\ge 1$, which matches our upper bound for $\beta=2$.

Cite

Text

Jia et al. "Smooth Non-Stationary Bandits." International Conference on Machine Learning, 2023.

Markdown

[Jia et al. "Smooth Non-Stationary Bandits." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/jia2023icml-smooth/)

BibTeX

@inproceedings{jia2023icml-smooth,
  title     = {{Smooth Non-Stationary Bandits}},
  author    = {Jia, Su and Xie, Qian and Kallus, Nathan and Frazier, Peter I.},
  booktitle = {International Conference on Machine Learning},
  year      = {2023},
  pages     = {14930-14944},
  volume    = {202},
  url       = {https://mlanthology.org/icml/2023/jia2023icml-smooth/}
}