Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

Abstract

We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over $[0,1]$. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and the first $m$ failures, respectively, where $m$ is a fixed parameter. This two-target algorithm achieves a long-term average regret in $\sqrt{2n}$ for a large parameter $m$ and a known time horizon $n$. This regret is optimal and strictly less than the regret achieved by the best known algorithms, which is in $2\sqrt{n}$. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons.

Cite

Text

Bonald and Proutiere. "Two-Target Algorithms  for Infinite-Armed   Bandits with Bernoulli Rewards." Neural Information Processing Systems, 2013.

Markdown

[Bonald and Proutiere. "Two-Target Algorithms  for Infinite-Armed   Bandits with Bernoulli Rewards." Neural Information Processing Systems, 2013.](https://mlanthology.org/neurips/2013/bonald2013neurips-twotarget/)

BibTeX

@inproceedings{bonald2013neurips-twotarget,
  title     = {{Two-Target Algorithms  for Infinite-Armed   Bandits with Bernoulli Rewards}},
  author    = {Bonald, Thomas and Proutiere, Alexandre},
  booktitle = {Neural Information Processing Systems},
  year      = {2013},
  pages     = {2184-2192},
  url       = {https://mlanthology.org/neurips/2013/bonald2013neurips-twotarget/}
}