Kernelized Reinforcement Learning with Order Optimal Regret Bounds
Abstract
Modern reinforcement learning (RL) has shown empirical success in various real world settings with complex models and large state-action spaces. The existing analytical results, however, typically focus on settings with a small number of state-actions or simple models such as linearly modeled state-action value functions. To derive RL policies that efficiently handle large state-action spaces with more general value functions, some recent works have considered nonlinear function approximation using kernel ridge regression. We propose $\pi$-KRVI, an optimistic modification of least-squares value iteration, when the action-value function is represented by an RKHS. We prove the first order-optimal regret guarantees under a general setting. Our results show a significant polynomial in the number of episodes improvement over the state of the art. In particular, with highly non-smooth kernels (such as Neural Tangent kernel or some Matérn kernels) the existing results lead to trivial (superlinear in the number of episodes) regret bounds. We show a sublinear regret bound that is order optimal in the cases where a lower bound on regret is known (which includes the kernels mentioned above).
Cite
Text
Vakili and Olkhovskaya. "Kernelized Reinforcement Learning with Order Optimal Regret Bounds." Neural Information Processing Systems, 2023.Markdown
[Vakili and Olkhovskaya. "Kernelized Reinforcement Learning with Order Optimal Regret Bounds." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/vakili2023neurips-kernelized/)BibTeX
@inproceedings{vakili2023neurips-kernelized,
title = {{Kernelized Reinforcement Learning with Order Optimal Regret Bounds}},
author = {Vakili, Sattar and Olkhovskaya, Julia},
booktitle = {Neural Information Processing Systems},
year = {2023},
url = {https://mlanthology.org/neurips/2023/vakili2023neurips-kernelized/}
}