VO$Q$L: Towards Optimal Regret in Model-Free RL with Nonlinear Function Approximation

Abstract

We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming closure under Bellman backups, and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\widetilde{O}(d\sqrt{TH}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.

Cite

Text

Agarwal et al. "VO$Q$L: Towards Optimal Regret in Model-Free RL with Nonlinear Function Approximation." Conference on Learning Theory, 2023.

Markdown

[Agarwal et al. "VO$Q$L: Towards Optimal Regret in Model-Free RL with Nonlinear Function Approximation." Conference on Learning Theory, 2023.](https://mlanthology.org/colt/2023/agarwal2023colt-voql/)

BibTeX

@inproceedings{agarwal2023colt-voql,
  title     = {{VO$Q$L: Towards Optimal Regret in Model-Free RL with Nonlinear Function Approximation}},
  author    = {Agarwal, Alekh and Jin, Yujia and Zhang, Tong},
  booktitle = {Conference on Learning Theory},
  year      = {2023},
  pages     = {987-1063},
  volume    = {195},
  url       = {https://mlanthology.org/colt/2023/agarwal2023colt-voql/}
}