A Simple Multi-Armed Bandit Algorithm with Optimal Variation-Bounded Regret

Abstract

We pose the question of whether it is possible to design a simple, linear-time algorithm for the basic multi-armed bandit problem in the adversarial setting which has a regret bound of $O(\sqrt{Q \log T})$, where $Q$ is the total quadratic variation of all the arms.

Cite

Text

Hazan and Kale. "A Simple Multi-Armed Bandit Algorithm with Optimal Variation-Bounded Regret." Proceedings of the 24th Annual Conference on Learning Theory, 2011.

Markdown

[Hazan and Kale. "A Simple Multi-Armed Bandit Algorithm with Optimal Variation-Bounded Regret." Proceedings of the 24th Annual Conference on Learning Theory, 2011.](https://mlanthology.org/colt/2011/hazan2011colt-simple/)

BibTeX

@inproceedings{hazan2011colt-simple,
  title     = {{A Simple Multi-Armed Bandit Algorithm with Optimal Variation-Bounded Regret}},
  author    = {Hazan, Elad and Kale, Satyen},
  booktitle = {Proceedings of the 24th Annual Conference on Learning Theory},
  year      = {2011},
  pages     = {817-820},
  volume    = {19},
  url       = {https://mlanthology.org/colt/2011/hazan2011colt-simple/}
}