SGD vs GD: Rank Deficiency in Linear Networks

Abstract

In this article, we study the behaviour of continuous-time gradient methods on a two-layer linear network with square loss. A dichotomy between SGD and GD is revealed: GD preserves the rank at initialization while (label noise) SGD diminishes the rank regardless of the initialization. We demonstrate this rank deficiency by studying the time evolution of the determinant of a matrix of parameters. To further understand this phenomenon, we derive the stochastic differential equation (SDE) governing the eigenvalues of the parameter matrix. This SDE unveils a replusive force between the eigenvalues: a key regularization mechanism which induces rank deficiency. Our results are well supported by experiments illustrating the phenomenon beyond linear networks and regression tasks.

Cite

Text

Varre et al. "SGD vs GD: Rank Deficiency in Linear Networks." Neural Information Processing Systems, 2024. doi:10.52202/079017-1921

Markdown

[Varre et al. "SGD vs GD: Rank Deficiency in Linear Networks." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/varre2024neurips-sgd/) doi:10.52202/079017-1921

BibTeX

@inproceedings{varre2024neurips-sgd,
  title     = {{SGD vs GD: Rank Deficiency in Linear Networks}},
  author    = {Varre, Aditya and Sagitova, Margarita and Flammarion, Nicolas},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-1921},
  url       = {https://mlanthology.org/neurips/2024/varre2024neurips-sgd/}
}