On the Convergence Properties of a K-Step Averaging Stochastic Gradient Descent Algorithm for Nonconvex Optimization

Abstract

We adopt and analyze a synchronous K-step averaging stochastic gradient descent algorithm which we call K-AVG  for solving large scale machine learning problems. We establish the convergence results of K-AVG for nonconvex objectives. Our analysis of K-AVG applies to many existing variants of synchronous SGD.  We explain why the K-step delay is necessary and leads to better performance than traditional parallel stochastic gradient descent which is equivalent to K-AVG with $K=1$. We also show that K-AVG scales better with the number of learners than asynchronous stochastic gradient descent (ASGD). Another advantage of K-AVG over ASGD is that it allows larger stepsizes and facilitates faster convergence. On a cluster of $128$ GPUs, K-AVG is faster than ASGD implementations and achieves better accuracies and faster convergence for training with the CIFAR-10 dataset.

Cite

Text

Zhou and Cong. "On the Convergence Properties of a K-Step Averaging Stochastic Gradient Descent Algorithm for Nonconvex Optimization." International Joint Conference on Artificial Intelligence, 2018. doi:10.24963/IJCAI.2018/447

Markdown

[Zhou and Cong. "On the Convergence Properties of a K-Step Averaging Stochastic Gradient Descent Algorithm for Nonconvex Optimization." International Joint Conference on Artificial Intelligence, 2018.](https://mlanthology.org/ijcai/2018/zhou2018ijcai-convergence/) doi:10.24963/IJCAI.2018/447

BibTeX

@inproceedings{zhou2018ijcai-convergence,
  title     = {{On the Convergence Properties of a K-Step Averaging Stochastic Gradient Descent Algorithm for Nonconvex Optimization}},
  author    = {Zhou, Fan and Cong, Guojing},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2018},
  pages     = {3219-3227},
  doi       = {10.24963/IJCAI.2018/447},
  url       = {https://mlanthology.org/ijcai/2018/zhou2018ijcai-convergence/}
}