Online Sparse Reinforcement Learning

Abstract

We investigate the hardness of online reinforcement learning in sparse linear Markov decision process (MDP), with a special focus on the high-dimensional regime where the ambient dimension is larger than the number of episodes. Our contribution is two-fold. First, we provide a lower bound showing that linear regret is generally unavoidable, even if there exists a policy that collects well-conditioned data. Second, we show that if the learner has oracle access to a policy that collects well-conditioned data, then a variant of Lasso fitted Q-iteration enjoys a regret of $O(N^{2/3})$ where $N$ is the number of episodes.

Cite

Text

Hao et al. "Online Sparse Reinforcement Learning." Artificial Intelligence and Statistics, 2021.

Markdown

[Hao et al. "Online Sparse Reinforcement Learning." Artificial Intelligence and Statistics, 2021.](https://mlanthology.org/aistats/2021/hao2021aistats-online/)

BibTeX

@inproceedings{hao2021aistats-online,
  title     = {{Online Sparse Reinforcement Learning}},
  author    = {Hao, Botao and Lattimore, Tor and Szepesvari, Csaba and Wang, Mengdi},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2021},
  pages     = {316-324},
  volume    = {130},
  url       = {https://mlanthology.org/aistats/2021/hao2021aistats-online/}
}