PAC Bounds for Multi-Armed Bandit and Markov Decision Processes

Abstract

The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O ( n/ε ^2log 1/δ ) times to find an ∈-optimal arm with probability of at least 1 - δ . This is in contrast to the naive bound of O ( n/ε ^2log n/δ ). We derive another algorithm whose complexity depends on the specific setting of the rewards, rather than the worst case setting. We also provide a matching lower bound. We show how given an algorithm for the PAC model Multi-armed Bandit problem, one can derive a batch learning algorithm for Markov Decision Processes. This is done essentially by simulating Value Iteration, and in each iteration invoking the multi-armed bandit algorithm. Using our PAC algorithm for the multi-armed bandit problem we improve the dependence on the number of actions.

Cite

Text

Even-Dar et al. "PAC Bounds for Multi-Armed Bandit and Markov Decision Processes." Annual Conference on Computational Learning Theory, 2002. doi:10.1007/3-540-45435-7_18

Markdown

[Even-Dar et al. "PAC Bounds for Multi-Armed Bandit and Markov Decision Processes." Annual Conference on Computational Learning Theory, 2002.](https://mlanthology.org/colt/2002/evendar2002colt-pac/) doi:10.1007/3-540-45435-7_18

BibTeX

@inproceedings{evendar2002colt-pac,
  title     = {{PAC Bounds for Multi-Armed Bandit and Markov Decision Processes}},
  author    = {Even-Dar, Eyal and Mannor, Shie and Mansour, Yishay},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {2002},
  pages     = {255-270},
  doi       = {10.1007/3-540-45435-7_18},
  url       = {https://mlanthology.org/colt/2002/evendar2002colt-pac/}
}