Decentralized SGD and Average-Direction SAM Are Asymptotically Equivalent

Abstract

Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$\beta$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios.

Cite

Text

Zhu et al. "Decentralized SGD and Average-Direction SAM Are Asymptotically Equivalent." International Conference on Machine Learning, 2023.

Markdown

[Zhu et al. "Decentralized SGD and Average-Direction SAM Are Asymptotically Equivalent." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/zhu2023icml-decentralized/)

BibTeX

@inproceedings{zhu2023icml-decentralized,
  title     = {{Decentralized SGD and Average-Direction SAM Are Asymptotically Equivalent}},
  author    = {Zhu, Tongtian and He, Fengxiang and Chen, Kaixuan and Song, Mingli and Tao, Dacheng},
  booktitle = {International Conference on Machine Learning},
  year      = {2023},
  pages     = {43005-43036},
  volume    = {202},
  url       = {https://mlanthology.org/icml/2023/zhu2023icml-decentralized/}
}