Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments

Abstract

We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam/AdamW. Additionally, NovoGrad (1) is robust to the choice of learning rate and weight initialization, (2) works well in a large batch setting, and (3) has two times smaller memory footprint than Adam.

Cite

Text

Ginsburg et al. "Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments." International Conference on Learning Representations, 2020.

Markdown

[Ginsburg et al. "Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments." International Conference on Learning Representations, 2020.](https://mlanthology.org/iclr/2020/ginsburg2020iclr-training/)

BibTeX

@inproceedings{ginsburg2020iclr-training,
  title     = {{Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments}},
  author    = {Ginsburg, Boris and Castonguay, Patrice and Hrinchuk, Oleksii and Kuchaiev, Oleksii and Lavrukhin, Vitaly and Leary, Ryan and Li, Jason and Nguyen, Huyen and Zhang, Yang and Cohen, Jonathan M.},
  booktitle = {International Conference on Learning Representations},
  year      = {2020},
  url       = {https://mlanthology.org/iclr/2020/ginsburg2020iclr-training/}
}