Regret Minimization in Heavy-Tailed Bandits

Abstract

We revisit the classic regret-minimization problem in the stochastic multi-armed bandit setting when the arm-distributions are allowed to be heavy-tailed. Regret minimization has been well studied in simpler settings of either bounded support reward distributions or distributions that belong to a single parameter exponential family. We work under the much weaker assumption that the moments of order \((1+\epsilon)\)are uniformly bounded by a known constant \(B\), for some given \( \epsilon > 0\). We propose an optimal algorithm that matches the lower bound exactly in the first-order term. We also give a finite-time bound on its regret. We show that our index concentrates faster than the well-known truncated or trimmed empirical mean estimators for the mean of heavy-tailed distributions. Computing our index can be computationally demanding. To address this, we develop a batch-based algorithm that is optimal up to a multiplicative constant depending on the batch size. We hence provide a controlled trade-off between statistical optimality and computational cost.

Cite

Text

Agrawal et al. "Regret Minimization in Heavy-Tailed Bandits." Conference on Learning Theory, 2021.

Markdown

[Agrawal et al. "Regret Minimization in Heavy-Tailed Bandits." Conference on Learning Theory, 2021.](https://mlanthology.org/colt/2021/agrawal2021colt-regret/)

BibTeX

@inproceedings{agrawal2021colt-regret,
  title     = {{Regret Minimization in Heavy-Tailed Bandits}},
  author    = {Agrawal, Shubhada and Juneja, Sandeep K. and Koolen, Wouter M.},
  booktitle = {Conference on Learning Theory},
  year      = {2021},
  pages     = {26-62},
  volume    = {134},
  url       = {https://mlanthology.org/colt/2021/agrawal2021colt-regret/}
}