Zeno: Distributed Stochastic Gradient Descent with Suspicion-Based Fault-Tolerance

Abstract

We present Zeno, a technique to make distributed machine learning, particularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number of faulty workers. Zeno generalizes previous results that assumed a majority of non-faulty nodes; we need assume only one non-faulty worker. Our key idea is to suspect workers that are potentially defective. Since this is likely to lead to false positives, we use a ranking-based preference mechanism. We prove the convergence of SGD for non-convex problems under these scenarios. Experimental results show that Zeno outperforms existing approaches.

Cite

Text

Xie et al. "Zeno: Distributed Stochastic Gradient Descent with Suspicion-Based Fault-Tolerance." International Conference on Machine Learning, 2019.

Markdown

[Xie et al. "Zeno: Distributed Stochastic Gradient Descent with Suspicion-Based Fault-Tolerance." International Conference on Machine Learning, 2019.](https://mlanthology.org/icml/2019/xie2019icml-zeno/)

BibTeX

@inproceedings{xie2019icml-zeno,
  title     = {{Zeno: Distributed Stochastic Gradient Descent with Suspicion-Based Fault-Tolerance}},
  author    = {Xie, Cong and Koyejo, Sanmi and Gupta, Indranil},
  booktitle = {International Conference on Machine Learning},
  year      = {2019},
  pages     = {6893-6901},
  volume    = {97},
  url       = {https://mlanthology.org/icml/2019/xie2019icml-zeno/}
}