Online Sparse Reinforcement Learning
Abstract
We investigate the hardness of online reinforcement learning in sparse linear Markov decision process (MDP), with a special focus on the high-dimensional regime where the ambient dimension is larger than the number of episodes. Our contribution is two-fold. First, we provide a lower bound showing that linear regret is generally unavoidable, even if there exists a policy that collects well-conditioned data. Second, we show that if the learner has oracle access to a policy that collects well-conditioned data, then a variant of Lasso fitted Q-iteration enjoys a regret of $O(N^{2/3})$ where $N$ is the number of episodes.
Cite
Text
Hao et al. "Online Sparse Reinforcement Learning." Artificial Intelligence and Statistics, 2021.Markdown
[Hao et al. "Online Sparse Reinforcement Learning." Artificial Intelligence and Statistics, 2021.](https://mlanthology.org/aistats/2021/hao2021aistats-online/)BibTeX
@inproceedings{hao2021aistats-online,
title = {{Online Sparse Reinforcement Learning}},
author = {Hao, Botao and Lattimore, Tor and Szepesvari, Csaba and Wang, Mengdi},
booktitle = {Artificial Intelligence and Statistics},
year = {2021},
pages = {316-324},
volume = {130},
url = {https://mlanthology.org/aistats/2021/hao2021aistats-online/}
}