Gradient Descent: Second Order Momentum and Saturating Error
Abstract
Batch gradient descent, ~w(t) = -7JdE/dw(t) , conver~es to a minimum of quadratic form with a time constant no better than '4Amax/ Amin where Amin and Amax are the minimum and maximum eigenvalues of the Hessian matrix of E with respect to w. It was recently shown that adding a momentum term ~w(t) = -7JdE/dw(t) + Q'~w(t - 1) improves this to ~ VAmax/ Amin, although only in the batch case. Here we show that second(cid:173) order momentum, ~w(t) = -7JdE/dw(t) + Q'~w(t -1) + (3~w(t - 2), can lower this no further. We then regard gradient descent with momentum as a dynamic system and explore a non quadratic error surface, showing that saturation of the error accounts for a variety of effects observed in simulations and justifies some popular heuristics.
Cite
Text
Pearlmutter. "Gradient Descent: Second Order Momentum and Saturating Error." Neural Information Processing Systems, 1991.Markdown
[Pearlmutter. "Gradient Descent: Second Order Momentum and Saturating Error." Neural Information Processing Systems, 1991.](https://mlanthology.org/neurips/1991/pearlmutter1991neurips-gradient/)BibTeX
@inproceedings{pearlmutter1991neurips-gradient,
title = {{Gradient Descent: Second Order Momentum and Saturating Error}},
author = {Pearlmutter, Barak},
booktitle = {Neural Information Processing Systems},
year = {1991},
pages = {887-894},
url = {https://mlanthology.org/neurips/1991/pearlmutter1991neurips-gradient/}
}