Weighted Linear Bandits for Non-Stationary Environments

Abstract

We consider a stochastic linear bandit model in which the available actions correspond to arbitrary context vectors whose associated rewards follow a non-stationary linear regression model. In this setting, the unknown regression parameter is allowed to vary in time. To address this problem, we propose D-LinUCB, a novel optimistic algorithm based on discounted linear regression, where exponential weights are used to smoothly forget the past. This involves studying the deviations of the sequential weighted least-squares estimator under generic assumptions. As a by-product, we obtain novel deviation results that can be used beyond non-stationary environments. We provide theoretical guarantees on the behavior of D-LinUCB in both slowly-varying and abruptly-changing environments. We obtain an upper bound on the dynamic regret that is of order d B_T^1/3T^2/3, where B_T is a measure of non-stationarity (d and T being, respectively, dimension and horizon). This rate is known to be optimal. We also illustrate the empirical performance of D-LinUCB and compare it with recently proposed alternatives in simulated environments.

Cite

Text

Russac et al. "Weighted Linear Bandits for Non-Stationary Environments." Neural Information Processing Systems, 2019.

Markdown

[Russac et al. "Weighted Linear Bandits for Non-Stationary Environments." Neural Information Processing Systems, 2019.](https://mlanthology.org/neurips/2019/russac2019neurips-weighted/)

BibTeX

@inproceedings{russac2019neurips-weighted,
  title     = {{Weighted Linear Bandits for Non-Stationary Environments}},
  author    = {Russac, Yoan and Vernade, Claire and Cappé, Olivier},
  booktitle = {Neural Information Processing Systems},
  year      = {2019},
  pages     = {12040-12049},
  url       = {https://mlanthology.org/neurips/2019/russac2019neurips-weighted/}
}