Gradient Coding: Avoiding Stragglers in Distributed Learning

Abstract

We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to failures and stragglers for synchronous Gradient Descent. We implement our schemes in python (using MPI) to run on Amazon EC2, and show how we compare against baseline approaches in running time and generalization error.

Cite

Text

Tandon et al. "Gradient Coding: Avoiding Stragglers in Distributed Learning." International Conference on Machine Learning, 2017.

Markdown

[Tandon et al. "Gradient Coding: Avoiding Stragglers in Distributed Learning." International Conference on Machine Learning, 2017.](https://mlanthology.org/icml/2017/tandon2017icml-gradient/)

BibTeX

@inproceedings{tandon2017icml-gradient,
  title     = {{Gradient Coding: Avoiding Stragglers in Distributed Learning}},
  author    = {Tandon, Rashish and Lei, Qi and Dimakis, Alexandros G. and Karampatziakis, Nikos},
  booktitle = {International Conference on Machine Learning},
  year      = {2017},
  pages     = {3368-3376},
  volume    = {70},
  url       = {https://mlanthology.org/icml/2017/tandon2017icml-gradient/}
}