Kernel-Based Reinforcement Learning: A Finite-Time Analysis
Abstract
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. For problems with $K$ episodes and horizon $H$, we provide a regret bound of $\widetilde{O}\left( H^3 K^{\frac{2d}{2d+1}}\right)$, where $d$ is the covering dimension of the joint state-action space. This is the first regret bound for kernel-based RL using smoothing kernels, which requires very weak assumptions on the MDP and applies to a wide range of tasks. We empirically validate our approach in continuous MDPs with sparse rewards.
Cite
Text
Domingues et al. "Kernel-Based Reinforcement Learning: A Finite-Time Analysis." International Conference on Machine Learning, 2021.Markdown
[Domingues et al. "Kernel-Based Reinforcement Learning: A Finite-Time Analysis." International Conference on Machine Learning, 2021.](https://mlanthology.org/icml/2021/domingues2021icml-kernelbased/)BibTeX
@inproceedings{domingues2021icml-kernelbased,
title = {{Kernel-Based Reinforcement Learning: A Finite-Time Analysis}},
author = {Domingues, Omar Darwiche and Menard, Pierre and Pirotta, Matteo and Kaufmann, Emilie and Valko, Michal},
booktitle = {International Conference on Machine Learning},
year = {2021},
pages = {2783-2792},
volume = {139},
url = {https://mlanthology.org/icml/2021/domingues2021icml-kernelbased/}
}