Momentum-Based Variance Reduction in Non-Convex SGD

Abstract

Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses $F$, STORM finds a point $x$ with $\mathbb{E}[\|\nabla F(x)\|]\le O(1/\sqrt{T}+\sigma^{1/3}/T^{1/3})$ in $T$ iterations with $\sigma^2$ variance in the gradients, matching the best-known rate but without requiring knowledge of $\sigma$.

Cite

Text

Cutkosky and Orabona. "Momentum-Based Variance Reduction in Non-Convex SGD." Neural Information Processing Systems, 2019.

Markdown

[Cutkosky and Orabona. "Momentum-Based Variance Reduction in Non-Convex SGD." Neural Information Processing Systems, 2019.](https://mlanthology.org/neurips/2019/cutkosky2019neurips-momentumbased/)

BibTeX

@inproceedings{cutkosky2019neurips-momentumbased,
  title     = {{Momentum-Based Variance Reduction in Non-Convex SGD}},
  author    = {Cutkosky, Ashok and Orabona, Francesco},
  booktitle = {Neural Information Processing Systems},
  year      = {2019},
  pages     = {15236-15245},
  url       = {https://mlanthology.org/neurips/2019/cutkosky2019neurips-momentumbased/}
}