Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training

Zhou, Pan; Xie, Xingyu; Lin, Zhouchen; Toh, Kim-Chuan; Yan, Shuicheng

Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training

Pan Zhou, Xingyu Xie, Zhouchen Lin, Kim-Chuan Toh, Shuicheng Yan

JMLR 2024 pp. 1-74

/jmlr/2024/zhou2024jmlr-win/

Abstract

Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, per iteration, we construct a dynamical loss that combines the vanilla training loss and a dynamic regularizer inspired by proximal point method, and respectively minimize the first- and second-order Taylor approximations of dynamical loss to update variable. This yields our Win acceleration that uses a conservative step and an aggressive step to update, and linearly combines these two updates for acceleration. Next, we extend Win into Win2 which uses multiple aggressive update steps for faster convergence. Then we apply Win and Win2 to the popular LAMB and SGD optimizers. Our transparent derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we theoretically justify the faster convergence of Win- and Win2-accelerated AdamW, Adam and LAMB to their non-accelerated counterparts. Experimental results demonstrates the faster convergence speed and superior performance of our Win- and Win2-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification and language modeling tasks.

PDF JMLR Code Semantic Scholar

Cite

Text

Zhou et al. "Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training." Journal of Machine Learning Research, 2024.

Markdown

[Zhou et al. "Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training." Journal of Machine Learning Research, 2024.](https://mlanthology.org/jmlr/2024/zhou2024jmlr-win/)

BibTeX

@article{zhou2024jmlr-win,
  title     = {{Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training}},
  author    = {Zhou, Pan and Xie, Xingyu and Lin, Zhouchen and Toh, Kim-Chuan and Yan, Shuicheng},
  journal   = {Journal of Machine Learning Research},
  year      = {2024},
  pages     = {1-74},
  volume    = {25},
  url       = {https://mlanthology.org/jmlr/2024/zhou2024jmlr-win/}
}