Strength of Minibatch Noise in SGD

Abstract

The noise in stochastic gradient descent (SGD), caused by minibatch sampling, is poorly understood despite its practical importance in deep learning. This work presents the first systematic study of the SGD noise and fluctuations close to a local minimum. We first analyze the SGD noise in linear regression in detail and then derive a general formula for approximating SGD noise in different types of minima. For application, our results (1) provide insight into the stability of training a neural network, (2) suggest that a large learning rate can help generalization by introducing an implicit regularization, (3) explain why the linear learning rate-batchsize scaling law fails at a large learning rate or at a small batchsize and (4) can provide an understanding of how discrete-time nature of SGD affects the recently discovered power-law phenomenon of SGD.

Cite

Text

Ziyin et al. "Strength of Minibatch Noise in SGD." International Conference on Learning Representations, 2022.

Markdown

[Ziyin et al. "Strength of Minibatch Noise in SGD." International Conference on Learning Representations, 2022.](https://mlanthology.org/iclr/2022/ziyin2022iclr-strength/)

BibTeX

@inproceedings{ziyin2022iclr-strength,
  title     = {{Strength of Minibatch Noise in SGD}},
  author    = {Ziyin, Liu and Liu, Kangqiao and Mori, Takashi and Ueda, Masahito},
  booktitle = {International Conference on Learning Representations},
  year      = {2022},
  url       = {https://mlanthology.org/iclr/2022/ziyin2022iclr-strength/}
}