Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound

Abstract

Exploration in reinforcement learning (RL) suffers from the curse of dimensionality when the state-action space is large. A common practice is to parameterize the high-dimensional value and policy functions using given features. However existing methods either have no theoretical guarantee or suffer a regret that is exponential in the planning horizon $H$.In this paper, we propose an online RL algorithm, namely the MatrixRL, that leverages ideas from linear bandit to learn a low-dimensional representation of the probability transition model while carefully balancing the exploitation-exploration tradeoff. We show that MatrixRL achieves a regret bound ${O}\big(H^2d\log T\sqrt{T}\big)$ where $d$ is the number of features, independent with the number of state-action pairs. MatrixRL has an equivalent kernelized version, which is able to work with an arbitrary kernel Hilbert space without using explicit features. In this case, the kernelized MatrixRL satisfies a regret bound ${O}\big(H^2\wt{d}\log T\sqrt{T}\big)$, where $\wt{d}$ is the effective dimension of the kernel space.

Cite

Text

Yang and Wang. "Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound." International Conference on Machine Learning, 2020.

Markdown

[Yang and Wang. "Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound." International Conference on Machine Learning, 2020.](https://mlanthology.org/icml/2020/yang2020icml-reinforcement/)

BibTeX

@inproceedings{yang2020icml-reinforcement,
  title     = {{Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound}},
  author    = {Yang, Lin and Wang, Mengdi},
  booktitle = {International Conference on Machine Learning},
  year      = {2020},
  pages     = {10746-10756},
  volume    = {119},
  url       = {https://mlanthology.org/icml/2020/yang2020icml-reinforcement/}
}