Q-Learning for Bandit Problems

Abstract

Multi-armed bandits may be viewed as decompositionally-structured Markov decision processes (MDP's) with potentially very-large state sets. A particularly elegant methodology for computing optimal policies was developed over twenty ago by Gittins [Gittins & Jones, 1974]. Gittins' approach reduces the problem of finding optimal policies for the original MDP to a sequence of low-dimensional stopping problems whose solutions determine the optimal policy through the so-called “Gittins indices.” Katehakis and Veinott [Katehakis & Veinott, 1987] have shown that the Gittins index for a process in state i may be interpreted as a particular component of the maximum-value function associated with the “restart-in-i” process, a simple MDP to which standard solution methods for computing optimal policies, such as successive approximation, apply. This paper explores the problem of learning the Gittins indices on-line without the aid of a process model; it suggests utilizing process-state- specific Q-learning agents to solve their respective restart-in-state-i subproblems, and includes an example in which the online reinforcement learning approach is applied to a problem of stochastic scheduling-one instance drawn from a wide class of problems that may be formulated as bandit problems.

Cite

Text

Duff. "Q-Learning for Bandit Problems." International Conference on Machine Learning, 1995. doi:10.1016/B978-1-55860-377-6.50034-7

Markdown

[Duff. "Q-Learning for Bandit Problems." International Conference on Machine Learning, 1995.](https://mlanthology.org/icml/1995/duff1995icml-q/) doi:10.1016/B978-1-55860-377-6.50034-7

BibTeX

@inproceedings{duff1995icml-q,
  title     = {{Q-Learning for Bandit Problems}},
  author    = {Duff, Michael O.},
  booktitle = {International Conference on Machine Learning},
  year      = {1995},
  pages     = {209-217},
  doi       = {10.1016/B978-1-55860-377-6.50034-7},
  url       = {https://mlanthology.org/icml/1995/duff1995icml-q/}
}