Bayesian Distributed Stochastic Gradient Descent
Abstract
We introduce Bayesian distributed stochastic gradient descent (BDSGD), a high-throughput algorithm for training deep neural networks on parallel clusters. This algorithm uses amortized inference in a deep generative model to perform joint posterior predictive inference of mini-batch gradient computation times in a compute cluster specific manner. Specifically, our algorithm mitigates the straggler effect in synchronous, gradient-based optimization by choosing an optimal cutoff beyond which mini-batch gradient messages from slow workers are ignored. In our experiments, we show that eagerly discarding the mini-batch gradient computations of stragglers not only increases throughput but actually increases the overall rate of convergence as a function of wall-clock time by virtue of eliminating idleness. The principal novel contribution and finding of this work goes beyond this by demonstrating that using the predicted run-times from a generative model of cluster worker performance improves substantially over the static-cutoff prior art, leading to reduced deep neural net training times on large computer clusters.
Cite
Text
Teng and Wood. "Bayesian Distributed Stochastic Gradient Descent." Neural Information Processing Systems, 2018.Markdown
[Teng and Wood. "Bayesian Distributed Stochastic Gradient Descent." Neural Information Processing Systems, 2018.](https://mlanthology.org/neurips/2018/teng2018neurips-bayesian/)BibTeX
@inproceedings{teng2018neurips-bayesian,
title = {{Bayesian Distributed Stochastic Gradient Descent}},
author = {Teng, Michael and Wood, Frank},
booktitle = {Neural Information Processing Systems},
year = {2018},
pages = {6378-6388},
url = {https://mlanthology.org/neurips/2018/teng2018neurips-bayesian/}
}