Beyond No Regret: Instance-Dependent PAC Reinforcement Learning

Abstract

The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $\epsilon$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $\epsilon$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this is not possible—there exists a fundamental tradeoff between achieving low regret and identifying an $\epsilon$-optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity—yielding a complexity which scales with the suboptimality gaps and the “reachability” of a state. We show our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.

Cite

Text

Wagenmaker et al. "Beyond No Regret: Instance-Dependent PAC Reinforcement Learning." Conference on Learning Theory, 2022.

Markdown

[Wagenmaker et al. "Beyond No Regret: Instance-Dependent PAC Reinforcement Learning." Conference on Learning Theory, 2022.](https://mlanthology.org/colt/2022/wagenmaker2022colt-beyond/)

BibTeX

@inproceedings{wagenmaker2022colt-beyond,
  title     = {{Beyond No Regret: Instance-Dependent PAC Reinforcement Learning}},
  author    = {Wagenmaker, Andrew J and Simchowitz, Max and Jamieson, Kevin},
  booktitle = {Conference on Learning Theory},
  year      = {2022},
  pages     = {358-418},
  volume    = {178},
  url       = {https://mlanthology.org/colt/2022/wagenmaker2022colt-beyond/}
}