Speedy Q-Learning

Abstract

We introduce a new convergent variant of Q-learning, called speedy Q-learning, to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor \gamma only T=O\big(\log(n)/(\epsilon^2(1-\gamma)^4)\big) steps are required for the SQL algorithm to converge to an \epsilon-optimal action-value function with high probability. This bound has a better dependency on 1/\epsilon and 1/(1-\gamma), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both model-free and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning.

Cite

Text

Ghavamzadeh et al. "Speedy Q-Learning." Neural Information Processing Systems, 2011.

Markdown

[Ghavamzadeh et al. "Speedy Q-Learning." Neural Information Processing Systems, 2011.](https://mlanthology.org/neurips/2011/ghavamzadeh2011neurips-speedy/)

BibTeX

@inproceedings{ghavamzadeh2011neurips-speedy,
  title     = {{Speedy Q-Learning}},
  author    = {Ghavamzadeh, Mohammad and Kappen, Hilbert J. and Azar, Mohammad G. and Munos, Rémi},
  booktitle = {Neural Information Processing Systems},
  year      = {2011},
  pages     = {2411-2419},
  url       = {https://mlanthology.org/neurips/2011/ghavamzadeh2011neurips-speedy/}
}