Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Abstract

We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O((n/ε2)log(1/δ)) times to find an ε-optimal arm with probability of at least 1-δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.

Cite

Text

Even-Dar et al. "Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems." Journal of Machine Learning Research, 2006.

Markdown

[Even-Dar et al. "Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems." Journal of Machine Learning Research, 2006.](https://mlanthology.org/jmlr/2006/evendar2006jmlr-action/)

BibTeX

@article{evendar2006jmlr-action,
  title     = {{Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems}},
  author    = {Even-Dar, Eyal and Mannor, Shie and Mansour, Yishay},
  journal   = {Journal of Machine Learning Research},
  year      = {2006},
  pages     = {1079-1105},
  volume    = {7},
  url       = {https://mlanthology.org/jmlr/2006/evendar2006jmlr-action/}
}