SGD vs GD: Rank Deficiency in Linear Networks
Abstract
In this article, we study the behaviour of continuous-time gradient methods on a two-layer linear network with square loss. A dichotomy between SGD and GD is revealed: GD preserves the rank at initialization while (label noise) SGD diminishes the rank regardless of the initialization. We demonstrate this rank deficiency by studying the time evolution of the determinant of a matrix of parameters. To further understand this phenomenon, we derive the stochastic differential equation (SDE) governing the eigenvalues of the parameter matrix. This SDE unveils a replusive force between the eigenvalues: a key regularization mechanism which induces rank deficiency. Our results are well supported by experiments illustrating the phenomenon beyond linear networks and regression tasks.
Cite
Text
Varre et al. "SGD vs GD: Rank Deficiency in Linear Networks." Neural Information Processing Systems, 2024. doi:10.52202/079017-1921Markdown
[Varre et al. "SGD vs GD: Rank Deficiency in Linear Networks." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/varre2024neurips-sgd/) doi:10.52202/079017-1921BibTeX
@inproceedings{varre2024neurips-sgd,
title = {{SGD vs GD: Rank Deficiency in Linear Networks}},
author = {Varre, Aditya and Sagitova, Margarita and Flammarion, Nicolas},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-1921},
url = {https://mlanthology.org/neurips/2024/varre2024neurips-sgd/}
}