Infinitely Many-Armed Bandits with Unknown Value Distribution

Abstract

We consider a version of the classical stochastic Multi-Armed bandit problem in which the number of arms is large compared to the time horizon, with the goal of minimizing the cumulative regret. Here, the mean-reward (or value ) of newly chosen arms is assumed to be i.i.d. We further make the simplifying assumption that the value of an arm is revealed once this arm is chosen. We present a general lower bound on the regret, and learning algorithms that achieve this bound up to a logarithmic factor. Contrary to previous work, we do not assume that the functional form of the tail of the value distribution is known. Furthermore, we also consider a variant of our model where sampled arms are non-retainable, namely are lost if not used continuously, with similar near-optimality results.

Cite

Text

David and Shimkin. "Infinitely Many-Armed Bandits with Unknown Value Distribution." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2014. doi:10.1007/978-3-662-44848-9_20

Markdown

[David and Shimkin. "Infinitely Many-Armed Bandits with Unknown Value Distribution." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2014.](https://mlanthology.org/ecmlpkdd/2014/david2014ecmlpkdd-infinitely/) doi:10.1007/978-3-662-44848-9_20

BibTeX

@inproceedings{david2014ecmlpkdd-infinitely,
  title     = {{Infinitely Many-Armed Bandits with Unknown Value Distribution}},
  author    = {David, Yahel and Shimkin, Nahum},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2014},
  pages     = {307-322},
  doi       = {10.1007/978-3-662-44848-9_20},
  url       = {https://mlanthology.org/ecmlpkdd/2014/david2014ecmlpkdd-infinitely/}
}