Accumulated Gradient Normalization
Abstract
This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer achieves this by pushing a normalized sequence of first-order gradients to a parameter server. This implies that the magnitude of a worker delta is smaller compared to an accumulated gradient, and provides a better direction towards a minimum compared to first-order gradients, which in turn also forces possible implicit momentum fluctuations to be more aligned since we make the assumption that all workers contribute towards a single minima. As a result, our approach mitigates the parameter staleness problem more effectively since staleness in asynchrony induces (implicit) momentum, and achieves a better convergence rate compared to other optimizers such as asynchronous \textsceasgd and \textscdynsgd, which we show empirically.
Cite
Text
Hermans et al. "Accumulated Gradient Normalization." Proceedings of the Ninth Asian Conference on Machine Learning, 2017.Markdown
[Hermans et al. "Accumulated Gradient Normalization." Proceedings of the Ninth Asian Conference on Machine Learning, 2017.](https://mlanthology.org/acml/2017/hermans2017acml-accumulated/)BibTeX
@inproceedings{hermans2017acml-accumulated,
title = {{Accumulated Gradient Normalization}},
author = {Hermans, Joeri R. and Spanakis, Gerasimos and Möckel, Rico},
booktitle = {Proceedings of the Ninth Asian Conference on Machine Learning},
year = {2017},
pages = {439-454},
volume = {77},
url = {https://mlanthology.org/acml/2017/hermans2017acml-accumulated/}
}