Adam Exploits $\ell_\infty$-Geometry of Loss Landscape via Coordinate-Wise Adaptivity
Abstract
Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $\widetilde{O}(T^{-1/4})$. In this work, we argue that the exploitation of nice $\ell_\infty$-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $\ell_\infty$-geometry rather than the more common $\ell_2$-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $\ell_\infty$-geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.
Cite
Text
Xie et al. "Adam Exploits $\ell_\infty$-Geometry of Loss Landscape via Coordinate-Wise Adaptivity." International Conference on Learning Representations, 2025.Markdown
[Xie et al. "Adam Exploits $\ell_\infty$-Geometry of Loss Landscape via Coordinate-Wise Adaptivity." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/xie2025iclr-adam/)BibTeX
@inproceedings{xie2025iclr-adam,
title = {{Adam Exploits $\ell_\infty$-Geometry of Loss Landscape via Coordinate-Wise Adaptivity}},
author = {Xie, Shuo and Mohamadi, Mohamad Amin and Li, Zhiyuan},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/xie2025iclr-adam/}
}