Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

Abstract

Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to reduce the sharpness of loss surface. Here ``sharpness'' is carefully defined given that the loss is scale-invariant, a known consequence of normalization. Specifically, for a fairly broad class of neural nets with normalization, our theory explains how GD with a finite learning rate enters the so-called Edge of Stability (EoS) regime, and characterizes the trajectory of GD in this regime via a continuous sharpness-reduction flow.

Cite

Text

Lyu et al. "Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction." Neural Information Processing Systems, 2022.

Markdown

[Lyu et al. "Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/lyu2022neurips-understanding/)

BibTeX

@inproceedings{lyu2022neurips-understanding,
  title     = {{Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction}},
  author    = {Lyu, Kaifeng and Li, Zhiyuan and Arora, Sanjeev},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/lyu2022neurips-understanding/}
}