Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay

Abstract

We prove the Fast Equilibrium Conjecture proposed by Li et al., (2020), i.e., stochastic gradient descent (SGD) on a scale-invariant loss (e.g., using networks with various normalization schemes) with learning rate $\eta$ and weight decay factor $\lambda$ mixes in function space in $\mathcal{\tilde{O}}(\frac{1}{\lambda\eta})$ steps, under two standard assumptions: (1) the noise covariance matrix is non-degenerate and (2) the minimizers of the loss form a connected, compact and analytic manifold. The analysis uses the framework of Li et al., (2021) and shows that for every $T>0$, the iterates of SGD with learning rate $\eta$ and weight decay factor $\lambda$ on the scale-invariant loss converge in distribution in $\Theta\left(\eta^{-1}\lambda^{-1}(T+\ln(\lambda/\eta))\right)$ iterations as $\eta\lambda\to 0$ while satisfying $\eta \le O(\lambda)\le O(1)$. Moreover, the evolution of the limiting distribution can be described by a stochastic differential equation that mixes to the same equilibrium distribution for every initialization around the manifold of minimizers as $T\to\infty$.

Cite

Text

Li et al. "Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay." Neural Information Processing Systems, 2022.

Markdown

[Li et al. "Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/li2022neurips-fast/)

BibTeX

@inproceedings{li2022neurips-fast,
  title     = {{Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay}},
  author    = {Li, Zhiyuan and Wang, Tianhao and Yu, Dingli},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/li2022neurips-fast/}
}