A Simple Multi-Armed Bandit Algorithm with Optimal Variation-Bounded Regret
Abstract
We pose the question of whether it is possible to design a simple, linear-time algorithm for the basic multi-armed bandit problem in the adversarial setting which has a regret bound of $O(\sqrt{Q \log T})$, where $Q$ is the total quadratic variation of all the arms.
Cite
Text
Hazan and Kale. "A Simple Multi-Armed Bandit Algorithm with Optimal Variation-Bounded Regret." Proceedings of the 24th Annual Conference on Learning Theory, 2011.Markdown
[Hazan and Kale. "A Simple Multi-Armed Bandit Algorithm with Optimal Variation-Bounded Regret." Proceedings of the 24th Annual Conference on Learning Theory, 2011.](https://mlanthology.org/colt/2011/hazan2011colt-simple/)BibTeX
@inproceedings{hazan2011colt-simple,
title = {{A Simple Multi-Armed Bandit Algorithm with Optimal Variation-Bounded Regret}},
author = {Hazan, Elad and Kale, Satyen},
booktitle = {Proceedings of the 24th Annual Conference on Learning Theory},
year = {2011},
pages = {817-820},
volume = {19},
url = {https://mlanthology.org/colt/2011/hazan2011colt-simple/}
}