Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-in Time

Abstract

A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted Markov decision processes under the online setting. The existing algorithms either fail to achieve regret optimality or have to incur a high memory and computational cost. In addition, existing optimal algorithms all require a long burn-in time in order to achieve optimal sample efficiency, i.e., their optimality is not guaranteed unless sample size surpasses a high threshold. We address both open problems by introducing a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner. This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.

Cite

Text

Ji and Li. "Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-in Time." Neural Information Processing Systems, 2023.

Markdown

[Ji and Li. "Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-in Time." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/ji2023neurips-regretoptimal/)

BibTeX

@inproceedings{ji2023neurips-regretoptimal,
  title     = {{Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-in Time}},
  author    = {Ji, Xiang and Li, Gen},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/ji2023neurips-regretoptimal/}
}