The Best of Both Worlds: Stochastic and Adversarial Episodic MDPs with Unknown Transition

Abstract

We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes, with the goal of achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ regret when the losses are adversarial and simultaneously $\mathcal{O}(\log T)$ regret when the losses are (almost) stochastic. Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known, and leaves the case of unknown transition as a major open question. In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader (FTRL) framework together with a set of new techniques. Specifically, we first propose a loss-shifting trick in the FTRL analysis, which greatly simplifies the approach of [Jin and Luo, 2020] and already improves their results for the known transition case. Then, we extend this idea to the unknown transition case and develop a novel analysis which upper bounds the transition estimation error by the regret itself in the stochastic setting, a key property to ensure $\mathcal{O}(\log T)$ regret.

Cite

Text

Jin et al. "The Best of Both Worlds: Stochastic and Adversarial Episodic MDPs with Unknown Transition." Neural Information Processing Systems, 2021.

Markdown

[Jin et al. "The Best of Both Worlds: Stochastic and Adversarial Episodic MDPs with Unknown Transition." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/jin2021neurips-best/)

BibTeX

@inproceedings{jin2021neurips-best,
  title     = {{The Best of Both Worlds: Stochastic and Adversarial Episodic MDPs with Unknown Transition}},
  author    = {Jin, Tiancheng and Huang, Longbo and Luo, Haipeng},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/jin2021neurips-best/}
}