On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm
Abstract
As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $\frac{1}{K}\sum_{k=1}^K E[||\nabla f(x^k)||_1]\leq O(\frac{\sqrt{d}C}{K^{1/4}})$ for AdamW measured by $\ell_1$ norm, where K represents the iteration number, d denotes the model dimension, and C matches the constant in the optimal convergence rate of SGD. Theoretically, we have $E[||\nabla f(x)||_1]\geq\sqrt{\frac{2d}{\pi}}E[||\nabla f(x)||_2]$ when each element of $\nabla f(x)$ is generated from Gaussian distribution $\mathcal N(0,1)$. Empirically, our experimental results on real-world deep learning tasks reveal $||\nabla f(x)||_1=\varTheta(\sqrt{d})||\nabla f(x)||_2$. Both support that our convergence rate can be considered to be analogous to the optimal convergence rate of SGD.
Cite
Text
Li et al. "On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm." Advances in Neural Information Processing Systems, 2025.Markdown
[Li et al. "On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/li2025neurips-dk/)BibTeX
@inproceedings{li2025neurips-dk,
title = {{On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm}},
author = {Li, Huan and Dong, Yiming and Lin, Zhouchen},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/li2025neurips-dk/}
}