Reinforcement Learning in Finite MDPs: PAC Analysis

Abstract

We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These "PAC-MDP" algorithms include the well-known E3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX.

Cite

Text

Strehl et al. "Reinforcement Learning in Finite MDPs: PAC Analysis." Journal of Machine Learning Research, 2009.

Markdown

[Strehl et al. "Reinforcement Learning in Finite MDPs: PAC Analysis." Journal of Machine Learning Research, 2009.](https://mlanthology.org/jmlr/2009/strehl2009jmlr-reinforcement/)

BibTeX

@article{strehl2009jmlr-reinforcement,
  title     = {{Reinforcement Learning in Finite MDPs: PAC Analysis}},
  author    = {Strehl, Alexander L. and Li, Lihong and Littman, Michael L.},
  journal   = {Journal of Machine Learning Research},
  year      = {2009},
  pages     = {2413-2444},
  volume    = {10},
  url       = {https://mlanthology.org/jmlr/2009/strehl2009jmlr-reinforcement/}
}