Zap Q-Learning
Abstract
The Zap Q-learning algorithm introduced in this paper is an improvement of Watkins' original algorithm and recent competitors in several respects. It is a matrix-gain algorithm designed so that its asymptotic variance is optimal. Moreover, an ODE analysis suggests that the transient behavior is a close match to a deterministic Newton-Raphson implementation. This is made possible by a two time-scale update equation for the matrix gain sequence. The analysis suggests that the approach will lead to stable and efficient computation even for non-ideal parameterized settings. Numerical experiments confirm the quick convergence, even in such non-ideal cases.
Cite
Text
Devraj and Meyn. "Zap Q-Learning." Neural Information Processing Systems, 2017.Markdown
[Devraj and Meyn. "Zap Q-Learning." Neural Information Processing Systems, 2017.](https://mlanthology.org/neurips/2017/devraj2017neurips-zap/)BibTeX
@inproceedings{devraj2017neurips-zap,
title = {{Zap Q-Learning}},
author = {Devraj, Adithya M and Meyn, Sean},
booktitle = {Neural Information Processing Systems},
year = {2017},
pages = {2235-2244},
url = {https://mlanthology.org/neurips/2017/devraj2017neurips-zap/}
}