How to Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Abstract

Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$. However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when $f(x)$ is convex. If $f(x)$ is convex, to find a point with gradient norm $\varepsilon$, we design an algorithm SGD3 with a near-optimal rate $\tilde{O}(\varepsilon^{-2})$, improving the best known rate $O(\varepsilon^{-8/3})$. If $f(x)$ is nonconvex, to find its $\varepsilon$-approximate local minimum, we design an algorithm SGD5 with rate $\tilde{O}(\varepsilon^{-3.5})$, where previously SGD variants only achieve $\tilde{O}(\varepsilon^{-4})$. This is no slower than the best known stochastic version of Newton's method in all parameter regimes.

Cite

Text

Allen-Zhu. "How to Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD." Neural Information Processing Systems, 2018.

Markdown

[Allen-Zhu. "How to Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD." Neural Information Processing Systems, 2018.](https://mlanthology.org/neurips/2018/allenzhu2018neurips-make/)

BibTeX

@inproceedings{allenzhu2018neurips-make,
  title     = {{How to Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD}},
  author    = {Allen-Zhu, Zeyuan},
  booktitle = {Neural Information Processing Systems},
  year      = {2018},
  pages     = {1157-1167},
  url       = {https://mlanthology.org/neurips/2018/allenzhu2018neurips-make/}
}