Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

Abstract

In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the \emph{non-stationary case} in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a lower bound of $\Omega((H^3SA/\epsilon^2)\log(1/\delta))$ on the sample complexity of an $(\varepsilon,\delta)$-PAC algorithm for best policy identification in a non-stationary MDP, relying on a construction of “hard MDPs” which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the $\Omega(\sqrt{H^3SAT})$ regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.

Cite

Text

Domingues et al. "Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited." Proceedings of the 32nd International Conference on Algorithmic Learning Theory, 2021.

Markdown

[Domingues et al. "Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited." Proceedings of the 32nd International Conference on Algorithmic Learning Theory, 2021.](https://mlanthology.org/alt/2021/domingues2021alt-episodic/)

BibTeX

@inproceedings{domingues2021alt-episodic,
  title     = {{Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited}},
  author    = {Domingues, Omar Darwiche and Ménard, Pierre and Kaufmann, Emilie and Valko, Michal},
  booktitle = {Proceedings of the 32nd International Conference on Algorithmic Learning Theory},
  year      = {2021},
  pages     = {578-598},
  volume    = {132},
  url       = {https://mlanthology.org/alt/2021/domingues2021alt-episodic/}
}