Nearly Optimal Policy Optimization with Stable at Any Time Guarantee
Abstract
Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in Shani et al. (2020) is only $\tilde{O}(\sqrt{S^2AH^4K})$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon, and $K$ is the number of episodes, and there is a $\sqrt{SH}$ gap compared with the information theoretic lower bound $\tilde{\Omega}(\sqrt{SAH^3K})$ (Jin et al., 2018). To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (RPO-SAT), which features the property “Stable at Any Time”. We prove that our algorithm achieves $\tilde{O}(\sqrt{SAH^3K} + \sqrt{AH^4K})$ regret. When $S > H$, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL.
Cite
Text
Wu et al. "Nearly Optimal Policy Optimization with Stable at Any Time Guarantee." International Conference on Machine Learning, 2022.Markdown
[Wu et al. "Nearly Optimal Policy Optimization with Stable at Any Time Guarantee." International Conference on Machine Learning, 2022.](https://mlanthology.org/icml/2022/wu2022icml-nearly/)BibTeX
@inproceedings{wu2022icml-nearly,
title = {{Nearly Optimal Policy Optimization with Stable at Any Time Guarantee}},
author = {Wu, Tianhao and Yang, Yunchang and Zhong, Han and Wang, Liwei and Du, Simon and Jiao, Jiantao},
booktitle = {International Conference on Machine Learning},
year = {2022},
pages = {24243-24265},
volume = {162},
url = {https://mlanthology.org/icml/2022/wu2022icml-nearly/}
}