Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Abstract

Adaptive gradient algorithms combine the moving average idea with heavy ball acceleration to estimate accurate first- and second-order moments of the gradient for accelerating convergence. But Nesterov acceleration which converges faster than heavy ball acceleration in theory and also in many empirical cases, is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm (Adan) to speed up the training of deep neural networks. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method that avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the first- and second-order gradient moments in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $\epsilon$-approximate stationary point within $O(\epsilon^{-4})$ stochastic gradient complexity on the non-convex stochastic problems, matching the best-known lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers for vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g., ResNet, ConvNext, ViT, Swin, MAE, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, ResNet, MAE, e.t.c, and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k. Code is released at https://github.com/sail-sg/Adan.

Cite

Text

Xie et al. "Adan: Adaptive Nesterov  Momentum Algorithm for  Faster Optimizing Deep Models." NeurIPS 2022 Workshops: HITY, 2022.

Markdown

[Xie et al. "Adan: Adaptive Nesterov  Momentum Algorithm for  Faster Optimizing Deep Models." NeurIPS 2022 Workshops: HITY, 2022.](https://mlanthology.org/neuripsw/2022/xie2022neuripsw-adan/)

BibTeX

@inproceedings{xie2022neuripsw-adan,
  title     = {{Adan: Adaptive Nesterov  Momentum Algorithm for  Faster Optimizing Deep Models}},
  author    = {Xie, Xingyu and Zhou, Pan and Li, Huan and Lin, Zhouchen and Yan, Shuicheng},
  booktitle = {NeurIPS 2022 Workshops: HITY},
  year      = {2022},
  url       = {https://mlanthology.org/neuripsw/2022/xie2022neuripsw-adan/}
}