Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs

Abstract

We consider the problem of learning an $\varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Given access to a generative model, we achieve rate-optimal sample complexity by performing a simple, \emph{perturbed} version of least-squares value iteration with orthogonal trigonometric polynomials as features. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our $\widetilde{O}(\epsilon^{-2-d/(\nu+1)})$ sample complexity, where $d$ is the dimension of the state-action space and $\nu$ the order of smoothness, recovers the state-of-the-art result of discretization approaches for the special case of Lipschitz MDPs $(\nu=0)$. At the same time, for $\nu\to\infty$, it recovers and greatly generalizes the $O(\epsilon^{-2})$ rate of low-rank MDPs, which are more amenable to regression approaches. In this sense, our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.

Cite

Text

Maran et al. "Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs." Conference on Learning Theory, 2024.

Markdown

[Maran et al. "Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs." Conference on Learning Theory, 2024.](https://mlanthology.org/colt/2024/maran2024colt-projection/)

BibTeX

@inproceedings{maran2024colt-projection,
  title     = {{Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs}},
  author    = {Maran, Davide and Metelli, Alberto Maria and Papini, Matteo and Restelli, Marcello},
  booktitle = {Conference on Learning Theory},
  year      = {2024},
  pages     = {3743-3774},
  volume    = {247},
  url       = {https://mlanthology.org/colt/2024/maran2024colt-projection/}
}