Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

Abstract

Statistical performance bounds for reinforcement learning (RL) algorithms can be critical for high-stakes applications like healthcare. This paper introduces a new framework for theoretically measuring the performance of such algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework. In contrast to the PAC framework, the uniform version may be used to derive high probability regret guarantees and so forms a bridge between the two setups that has been missing in the literature. We demonstrate the benefits of the new framework for finite-state episodic MDPs with a new algorithm that is Uniform-PAC and simultaneously achieves optimal regret and PAC guarantees except for a factor of the horizon.

Cite

Text

Dann et al. "Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning." Neural Information Processing Systems, 2017.

Markdown

[Dann et al. "Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning." Neural Information Processing Systems, 2017.](https://mlanthology.org/neurips/2017/dann2017neurips-unifying/)

BibTeX

@inproceedings{dann2017neurips-unifying,
  title     = {{Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning}},
  author    = {Dann, Christoph and Lattimore, Tor and Brunskill, Emma},
  booktitle = {Neural Information Processing Systems},
  year      = {2017},
  pages     = {5713-5723},
  url       = {https://mlanthology.org/neurips/2017/dann2017neurips-unifying/}
}