Finite-Time Regret Bounds for the Multiarmed Bandit Problem

Abstract

We show finite-time regret bounds for the multiarmed bandit problem under the assumption that all rewards come from a bounded and fixed range. Our regret bounds after any number T of pulls are of the form a+b log T +c log 2 T , where a, b, and c are positive constants not depending on T . These bounds are shown to hold for variants of the popular "-greedy and Boltzmann allocation rules, and for a new simple deterministic allocation rule. Moreover, our results also apply to an extension of the basic bandit problem in which reward distributions can depend, to some extent, from previous pulls and observed rewards. Finally, we discuss the empirical performance of our algorithms with respect to specific choices of the reward distributions. 1 INTRODUCTION One of the fundamental issues in reinforcement learning is the exploration versus exploitation dilemma, whose simplest instance is, perhaps, the bandit problem. In its most basic formulation, a bandit problem is a set of N (with N ? 1)...

Cite

Text

Cesa-Bianchi and Fischer. "Finite-Time Regret Bounds for the Multiarmed Bandit Problem." International Conference on Machine Learning, 1998.

Markdown

[Cesa-Bianchi and Fischer. "Finite-Time Regret Bounds for the Multiarmed Bandit Problem." International Conference on Machine Learning, 1998.](https://mlanthology.org/icml/1998/cesabianchi1998icml-finite/)

BibTeX

@inproceedings{cesabianchi1998icml-finite,
  title     = {{Finite-Time Regret Bounds for the Multiarmed Bandit Problem}},
  author    = {Cesa-Bianchi, Nicolò and Fischer, Paul},
  booktitle = {International Conference on Machine Learning},
  year      = {1998},
  pages     = {100-108},
  url       = {https://mlanthology.org/icml/1998/cesabianchi1998icml-finite/}
}