Implicit Bias of AdamW: $\ell_∞$-Norm Constrained Optimization

Abstract

Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with $\ell_2$ regularization optimizes the $\ell_2$ regularized loss, it is not clear if AdamW optimizes a specific objective. In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss under the constraint that the $\ell_\infty$ norm of the parameter is bounded by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to $\ell_\infty$ norm, and a surprising connection between normalized steepest descent with weight decay and Frank-Wolfe.

Cite

Text

Xie and Li. "Implicit Bias of AdamW: $\ell_∞$-Norm Constrained Optimization." International Conference on Machine Learning, 2024.

Markdown

[Xie and Li. "Implicit Bias of AdamW: $\ell_∞$-Norm Constrained Optimization." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/xie2024icml-implicit/)

BibTeX

@inproceedings{xie2024icml-implicit,
  title     = {{Implicit Bias of AdamW: $\ell_∞$-Norm Constrained Optimization}},
  author    = {Xie, Shuo and Li, Zhiyuan},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {54488-54510},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/xie2024icml-implicit/}
}