Is Reinforcement Learning More Difficult than Bandits? a Near-Optimal Algorithm Escaping the Curse of Horizon
Abstract
Episodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity. We consider the episodic reinforcement learning with S states, A actions, planning horizon H, total reward bounded by 1, and the agent plays for K episodes. We propose a new algorithm, Monotonic Value Propagation (MVP), which relies on a new Bernstein-type bonus. The new bonus only requires tweaking the constants to ensure optimism and thus is significantly simpler than existing bonus constructions. We show MVP enjoys an $O\left(\left(\sqrt{SAK} + S^2A\right) \poly\log \left(SAHK\right)\right)$ regret, approaching the $\Omega\left(\sqrt{SAK}\right)$ lower bound of contextual bandits. Notably, this result 1) exponentially improves the state-of-the-art polynomial-time algorithms by Dann et al. [2019], Zanette et al. [2019], and Zhang et al. [2020] in terms of the dependency on H, and 2) exponentially improves the running time in [Wang et al. 2020] and significantly improves the dependency on S, A and K in sample complexity.
Cite
Text
Zhang et al. "Is Reinforcement Learning More Difficult than Bandits? a Near-Optimal Algorithm Escaping the Curse of Horizon." Conference on Learning Theory, 2021.Markdown
[Zhang et al. "Is Reinforcement Learning More Difficult than Bandits? a Near-Optimal Algorithm Escaping the Curse of Horizon." Conference on Learning Theory, 2021.](https://mlanthology.org/colt/2021/zhang2021colt-reinforcement/)BibTeX
@inproceedings{zhang2021colt-reinforcement,
title = {{Is Reinforcement Learning More Difficult than Bandits? a Near-Optimal Algorithm Escaping the Curse of Horizon}},
author = {Zhang, Zihan and Ji, Xiangyang and Du, Simon},
booktitle = {Conference on Learning Theory},
year = {2021},
pages = {4528-4531},
volume = {134},
url = {https://mlanthology.org/colt/2021/zhang2021colt-reinforcement/}
}