On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm

Abstract

As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $\frac{1}{K}\sum_{k=1}^K E[||\nabla f(x^k)||_1]\leq O(\frac{\sqrt{d}C}{K^{1/4}})$ for AdamW measured by $\ell_1$ norm, where K represents the iteration number, d denotes the model dimension, and C matches the constant in the optimal convergence rate of SGD. Theoretically, we have $E[||\nabla f(x)||_1]\geq\sqrt{\frac{2d}{\pi}}E[||\nabla f(x)||_2]$ when each element of $\nabla f(x)$ is generated from Gaussian distribution $\mathcal N(0,1)$. Empirically, our experimental results on real-world deep learning tasks reveal $||\nabla f(x)||_1=\varTheta(\sqrt{d})||\nabla f(x)||_2$. Both support that our convergence rate can be considered to be analogous to the optimal convergence rate of SGD.

Cite

Text

Li et al. "On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm." Advances in Neural Information Processing Systems, 2025.

Markdown

[Li et al. "On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/li2025neurips-dk/)

BibTeX

@inproceedings{li2025neurips-dk,
  title     = {{On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm}},
  author    = {Li, Huan and Dong, Yiming and Lin, Zhouchen},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/li2025neurips-dk/}
}