Randomized Exploration for Non-Stationary Stochastic Linear Bandits

Abstract

We investigate two perturbation approaches to overcome conservatism that optimism based algorithms chronically suffer from in practice. The first approach replaces optimism with a simple randomization when using confidence sets. The second one adds random perturbations to its current estimate before maximizing the expected reward. For non-stationary linear bandits, where each action is associated with a $d$-dimensional feature and the unknown parameter is time-varying with total variation $B_T$, we propose two randomized algorithms, Discounted Randomized LinUCB (D-RandLinUCB) and Discounted Linear Thompson Sampling (D-LinTS) via the two perturbation approaches. We highlight the statistical optimality versus computational efficiency trade-off between them in that the former asymptotically achieves the optimal dynamic regret $\tilde{O}(d ^{2/3}B_T^{1/3} T^{2/3})$, but the latter is oracle-efficient with an extra logarithmic factor in the number of arms compared to minimax-optimal dynamic regret. In a simulation study, both algorithms show the outstanding performance in tackling conservatism issue that Discounted LinUCB struggles with.

Cite

Text

Kim and Tewari. "Randomized Exploration for Non-Stationary Stochastic Linear Bandits." Uncertainty in Artificial Intelligence, 2020.

Markdown

[Kim and Tewari. "Randomized Exploration for Non-Stationary Stochastic Linear Bandits." Uncertainty in Artificial Intelligence, 2020.](https://mlanthology.org/uai/2020/kim2020uai-randomized/)

BibTeX

@inproceedings{kim2020uai-randomized,
  title     = {{Randomized Exploration for Non-Stationary Stochastic Linear Bandits}},
  author    = {Kim, Baekjin and Tewari, Ambuj},
  booktitle = {Uncertainty in Artificial Intelligence},
  year      = {2020},
  pages     = {71-80},
  volume    = {124},
  url       = {https://mlanthology.org/uai/2020/kim2020uai-randomized/}
}