Bayesian Distributed Stochastic Gradient Descent

Abstract

We introduce Bayesian distributed stochastic gradient descent (BDSGD), a high-throughput algorithm for training deep neural networks on parallel clusters. This algorithm uses amortized inference in a deep generative model to perform joint posterior predictive inference of mini-batch gradient computation times in a compute cluster specific manner. Specifically, our algorithm mitigates the straggler effect in synchronous, gradient-based optimization by choosing an optimal cutoff beyond which mini-batch gradient messages from slow workers are ignored. In our experiments, we show that eagerly discarding the mini-batch gradient computations of stragglers not only increases throughput but actually increases the overall rate of convergence as a function of wall-clock time by virtue of eliminating idleness. The principal novel contribution and finding of this work goes beyond this by demonstrating that using the predicted run-times from a generative model of cluster worker performance improves substantially over the static-cutoff prior art, leading to reduced deep neural net training times on large computer clusters.

Cite

Text

Teng and Wood. "Bayesian Distributed Stochastic Gradient Descent." Neural Information Processing Systems, 2018.

Markdown

[Teng and Wood. "Bayesian Distributed Stochastic Gradient Descent." Neural Information Processing Systems, 2018.](https://mlanthology.org/neurips/2018/teng2018neurips-bayesian/)

BibTeX

@inproceedings{teng2018neurips-bayesian,
  title     = {{Bayesian Distributed Stochastic Gradient Descent}},
  author    = {Teng, Michael and Wood, Frank},
  booktitle = {Neural Information Processing Systems},
  year      = {2018},
  pages     = {6378-6388},
  url       = {https://mlanthology.org/neurips/2018/teng2018neurips-bayesian/}
}