A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation
Abstract
The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $\tilde{O}(d\sqrt{HK})$ when $K$ is sufficiently large and near-optimal policy switching cost of $\tilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
Cite
Text
Zhao et al. "A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation." Neural Information Processing Systems, 2024. doi:10.52202/079017-3002Markdown
[Zhao et al. "A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/zhao2024neurips-nearly/) doi:10.52202/079017-3002BibTeX
@inproceedings{zhao2024neurips-nearly,
title = {{A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation}},
author = {Zhao, Heyang and He, Jiafan and Gu, Quanquan},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-3002},
url = {https://mlanthology.org/neurips/2024/zhao2024neurips-nearly/}
}