Does SGD Really Happen in Tiny Subspaces?
Abstract
Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies show that gradients approximately align with a low-rank eigenspace of the training loss Hessian, referred to as the dominant subspace. This paper investigates whether neural networks can be trained within this subspace. Our primary finding is that projecting the SGD update onto the dominant subspace does not reduce the training loss, suggesting the alignment between the gradient and dominant subspace is spurious. Surprisingly, excluding the dominant subspace component proves as effective as the original update. Similar observations are made for the large learning rate regime (also known as Edge of Stability) and Sharpness-Aware Minimization. We discuss the main causes and implications of this spurious alignment, shedding light on neural network training dynamics.
Cite
Text
Song et al. "Does SGD Really Happen in Tiny Subspaces?." ICML 2024 Workshops: HiLD, 2024.Markdown
[Song et al. "Does SGD Really Happen in Tiny Subspaces?." ICML 2024 Workshops: HiLD, 2024.](https://mlanthology.org/icmlw/2024/song2024icmlw-sgd/)BibTeX
@inproceedings{song2024icmlw-sgd,
title = {{Does SGD Really Happen in Tiny Subspaces?}},
author = {Song, Minhak and Ahn, Kwangjun and Yun, Chulhee},
booktitle = {ICML 2024 Workshops: HiLD},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/song2024icmlw-sgd/}
}