The Epoch-Greedy Algorithm for Multi-Armed Bandits with Side Information

Abstract

We present Epoch-Greedy, an algorithm for multi-armed bandits with observable side information. Epoch-Greedy has the following properties: No knowledge of a time horizon $T$ is necessary. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. The regret scales as $O(T^{2/3} S^{1/3})$ or better (sometimes, much better). Here $S$ is the complexity term in a sample complexity bound for standard supervised learning.

Cite

Text

Langford and Zhang. "The Epoch-Greedy Algorithm for Multi-Armed Bandits with Side Information." Neural Information Processing Systems, 2007.

Markdown

[Langford and Zhang. "The Epoch-Greedy Algorithm for Multi-Armed Bandits with Side Information." Neural Information Processing Systems, 2007.](https://mlanthology.org/neurips/2007/langford2007neurips-epochgreedy/)

BibTeX

@inproceedings{langford2007neurips-epochgreedy,
  title     = {{The Epoch-Greedy Algorithm for Multi-Armed Bandits with Side Information}},
  author    = {Langford, John and Zhang, Tong},
  booktitle = {Neural Information Processing Systems},
  year      = {2007},
  pages     = {817-824},
  url       = {https://mlanthology.org/neurips/2007/langford2007neurips-epochgreedy/}
}