RL for Latent MDPs: Regret Guarantees and a Lower Bound

Abstract

In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least $\Omega((SA)^M)$ episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, {\it i.e.,} providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e.g., \cite{boots2011online}) and a reachability assumption, we show that the need for initialization can be removed.

Cite

Text

Kwon et al. "RL for Latent MDPs: Regret Guarantees and a Lower Bound." Neural Information Processing Systems, 2021.

Markdown

[Kwon et al. "RL for Latent MDPs: Regret Guarantees and a Lower Bound." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/kwon2021neurips-rl/)

BibTeX

@inproceedings{kwon2021neurips-rl,
  title     = {{RL for Latent MDPs: Regret Guarantees and a Lower Bound}},
  author    = {Kwon, Jeongyeol and Efroni, Yonathan and Caramanis, Constantine and Mannor, Shie},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/kwon2021neurips-rl/}
}