Understanding Adam Requires Better Rotation Dependent Assumptions
Abstract
Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space. We demonstrate that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve its empirical performance. We then examine the rotation-dependent assumptions in the literature, evaluating their adequacy in explaining Adam's behaviour across various rotation types. This work highlights the need for new, rotation-dependent theoretical frameworks to understand Adam's empirical success in modern machine learning fully.
Cite
Text
Zhang et al. "Understanding Adam Requires Better Rotation Dependent Assumptions." NeurIPS 2024 Workshops: OPT, 2024.Markdown
[Zhang et al. "Understanding Adam Requires Better Rotation Dependent Assumptions." NeurIPS 2024 Workshops: OPT, 2024.](https://mlanthology.org/neuripsw/2024/zhang2024neuripsw-understanding/)BibTeX
@inproceedings{zhang2024neuripsw-understanding,
title = {{Understanding Adam Requires Better Rotation Dependent Assumptions}},
author = {Zhang, Tianyue H. and Maes, Lucas and Jolicoeur-Martineau, Alexia and Mitliagkas, Ioannis and Scieur, Damien and Lacoste-Julien, Simon and Guille-Escuret, Charles},
booktitle = {NeurIPS 2024 Workshops: OPT},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/zhang2024neuripsw-understanding/}
}