Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

Abstract

State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing full-planning on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon MDP setting and establish that exploring with greedy policies -- act by 1-step planning -- can achieve tight minimax performance in terms of regret, O(\sqrt{HSAT}). Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of S. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to such that act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.

Cite

Text

Efroni et al. "Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies." Neural Information Processing Systems, 2019.

Markdown

[Efroni et al. "Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies." Neural Information Processing Systems, 2019.](https://mlanthology.org/neurips/2019/efroni2019neurips-tight/)

BibTeX

@inproceedings{efroni2019neurips-tight,
  title     = {{Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies}},
  author    = {Efroni, Yonathan and Merlis, Nadav and Ghavamzadeh, Mohammad and Mannor, Shie},
  booktitle = {Neural Information Processing Systems},
  year      = {2019},
  pages     = {12224-12234},
  url       = {https://mlanthology.org/neurips/2019/efroni2019neurips-tight/}
}