BooVI: Provably Efficient Bootstrapped Value Iteration
Abstract
Despite the tremendous success of reinforcement learning (RL) with function approximation, efficient exploration remains a significant challenge, both practically and theoretically. In particular, existing theoretically grounded RL algorithms based on upper confidence bounds (UCBs), such as optimistic least-squares value iteration (LSVI), are often incompatible with practically powerful function approximators, such as neural networks. In this paper, we develop a variant of \underline{boo}tstrapped LS\underline{VI}, namely BooVI, which bridges such a gap between practice and theory. Practically, BooVI drives exploration through (re)sampling, making it compatible with general function approximators. Theoretically, BooVI inherits the worst-case $\tilde{O}(\sqrt{d^3 H^3 T})$-regret of optimistic LSVI in the episodic linear setting. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps.
Cite
Text
Liu et al. "BooVI: Provably Efficient Bootstrapped Value Iteration." Neural Information Processing Systems, 2021.Markdown
[Liu et al. "BooVI: Provably Efficient Bootstrapped Value Iteration." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/liu2021neurips-boovi/)BibTeX
@inproceedings{liu2021neurips-boovi,
title = {{BooVI: Provably Efficient Bootstrapped Value Iteration}},
author = {Liu, Boyi and Cai, Qi and Yang, Zhuoran and Wang, Zhaoran},
booktitle = {Neural Information Processing Systems},
year = {2021},
url = {https://mlanthology.org/neurips/2021/liu2021neurips-boovi/}
}