Zap Q-Learning

Abstract

The Zap Q-learning algorithm introduced in this paper is an improvement of Watkins' original algorithm and recent competitors in several respects. It is a matrix-gain algorithm designed so that its asymptotic variance is optimal. Moreover, an ODE analysis suggests that the transient behavior is a close match to a deterministic Newton-Raphson implementation. This is made possible by a two time-scale update equation for the matrix gain sequence. The analysis suggests that the approach will lead to stable and efficient computation even for non-ideal parameterized settings. Numerical experiments confirm the quick convergence, even in such non-ideal cases.

Cite

Text

Devraj and Meyn. "Zap Q-Learning." Neural Information Processing Systems, 2017.

Markdown

[Devraj and Meyn. "Zap Q-Learning." Neural Information Processing Systems, 2017.](https://mlanthology.org/neurips/2017/devraj2017neurips-zap/)

BibTeX

@inproceedings{devraj2017neurips-zap,
  title     = {{Zap Q-Learning}},
  author    = {Devraj, Adithya M and Meyn, Sean},
  booktitle = {Neural Information Processing Systems},
  year      = {2017},
  pages     = {2235-2244},
  url       = {https://mlanthology.org/neurips/2017/devraj2017neurips-zap/}
}