Model-Based Reinforcement Learning with a Generative Model Is Minimax Optimal

Abstract

This work considers the sample and computational complexity of obtaining an $\epsilon$-optimal policy in a discounted Markov Decision Process (MDP), given only access to a generative model. In this model, the learner accesses the underlying transition model via a sampling oracle that provides a sample of the next state, when given any state-action pair as input. We are interested in a basic and unresolved question in model based planning: is this naïve “plug-in” approach — where we build the maximum likelihood estimate of the transition model in the MDP from observations and then find an optimal policy in this empirical MDP — non-asymptotically, minimax optimal? Our main result answers this question positively. With regards to computation, our result provides a simpler approach towards minimax optimal planning: in comparison to prior model-free results, we show that using \emph{any} high accuracy, black-box planning oracle in the empirical model suffices to obtain the minimax error rate. The key proof technique uses a leave-one-out analysis, in a novel “absorbing MDP” construction, to decouple the statistical dependency issues that arise in the analysis of model-based planning; this construction may be helpful more generally.

Cite

Text

Agarwal et al. "Model-Based Reinforcement Learning with a Generative Model Is Minimax Optimal." Conference on Learning Theory, 2020.

Markdown

[Agarwal et al. "Model-Based Reinforcement Learning with a Generative Model Is Minimax Optimal." Conference on Learning Theory, 2020.](https://mlanthology.org/colt/2020/agarwal2020colt-modelbased/)

BibTeX

@inproceedings{agarwal2020colt-modelbased,
  title     = {{Model-Based Reinforcement Learning with a Generative Model Is Minimax Optimal}},
  author    = {Agarwal, Alekh and Kakade, Sham and Yang, Lin F.},
  booktitle = {Conference on Learning Theory},
  year      = {2020},
  pages     = {67-83},
  volume    = {125},
  url       = {https://mlanthology.org/colt/2020/agarwal2020colt-modelbased/}
}