Decentralized Learning: Theoretical Optimality and Practical Improvements

Abstract

Decentralization is a promising method of scaling up parallel machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction this lower bound is tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. While a simple version of DeTAG with plain SGD and constant step size suffice for achieving theoretical limits, we additionally provide convergence bound for DeTAG under general non-increasing step size and momentum. Empirically, we compare DeTAG with other decentralized algorithms on multiple vision benchmarks, including CIFAR10/100 and ImageNet. We substantiate our theory and show DeTAG converges faster on unshuffled data and in sparse networks. Furthermore, we study a DeTAG variant, DeTAG*, that practically speeds up data-center-scale model training. This manuscript provides extended contents to its ICML version.

Cite

Text

Lu and De Sa. "Decentralized Learning: Theoretical Optimality and Practical Improvements." Journal of Machine Learning Research, 2023.

Markdown

[Lu and De Sa. "Decentralized Learning: Theoretical Optimality and Practical Improvements." Journal of Machine Learning Research, 2023.](https://mlanthology.org/jmlr/2023/lu2023jmlr-decentralized/)

BibTeX

@article{lu2023jmlr-decentralized,
  title     = {{Decentralized Learning: Theoretical Optimality and Practical Improvements}},
  author    = {Lu, Yucheng and De Sa, Christopher},
  journal   = {Journal of Machine Learning Research},
  year      = {2023},
  pages     = {1-62},
  volume    = {24},
  url       = {https://mlanthology.org/jmlr/2023/lu2023jmlr-decentralized/}
}