Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning

Abstract

Recently, there has been significant progress in understanding reinforcement learning in discounted infinite-horizon Markov decision processes (MDPs) by deriving tight sample complexity bounds. However, in many real-world applications, an interactive learning agent operates for a fixed or bounded period of time, for example tutoring students for exams or handling customer service requests. Such scenarios can often be better treated as episodic fixed-horizon MDPs, for which only looser bounds on the sample complexity exist. A natural notion of sample complexity in this setting is the number of episodes required to guarantee a certain performance with high probability (PAC guarantee). In this paper, we derive an upper PAC bound of order O(|S|²|A|H² log(1/δ)/ɛ²) and a lower PAC bound Ω(|S||A|H² log(1/(δ+c))/ɛ²) (ignoring log-terms) that match up to log-terms and an additional linear dependency on the number of states |S|. The lower bound is the first of its kind for this setting. Our upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-horizon dependency of at least H³.

Cite

Text

Dann and Brunskill. "Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning." Neural Information Processing Systems, 2015.

Markdown

[Dann and Brunskill. "Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning." Neural Information Processing Systems, 2015.](https://mlanthology.org/neurips/2015/dann2015neurips-sample/)

BibTeX

@inproceedings{dann2015neurips-sample,
  title     = {{Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning}},
  author    = {Dann, Christoph and Brunskill, Emma},
  booktitle = {Neural Information Processing Systems},
  year      = {2015},
  pages     = {2818-2826},
  url       = {https://mlanthology.org/neurips/2015/dann2015neurips-sample/}
}