ResIST: Layer-Wise Decomposition of ResNets for Distributed Training

Abstract

We propose ResIST, a novel distributed training protocol for Residual Networks (ResNets). ResIST randomly decomposes a global ResNet into several shallow sub-ResNets that are trained independently in a distributed manner for several local iterations, before having their updates synchronized and aggregated into the global model. In the next round, new sub-ResNets are randomly generated and the process repeats until convergence. By construction, per iteration, ResIST communicates only a small portion of network parameters to each machine and never uses the full model during training. Thus, ResIST reduces the per-iteration communication, memory, and time requirements of ResNet training to only a fraction of the requirements of full-model training. In comparison to common protocols, like data-parallel training and data-parallel training with local SGD, ResIST yields a decrease in communication and compute requirements, while being competitive with respect to model performance.

Cite

Text

Dun et al. "ResIST: Layer-Wise Decomposition of ResNets for Distributed Training." Uncertainty in Artificial Intelligence, 2022.

Markdown

[Dun et al. "ResIST: Layer-Wise Decomposition of ResNets for Distributed Training." Uncertainty in Artificial Intelligence, 2022.](https://mlanthology.org/uai/2022/dun2022uai-resist/)

BibTeX

@inproceedings{dun2022uai-resist,
  title     = {{ResIST: Layer-Wise Decomposition of ResNets for Distributed Training}},
  author    = {Dun, Chen and Wolfe, Cameron R. and Jermaine, Christopher M. and Kyrillidis, Anastasios},
  booktitle = {Uncertainty in Artificial Intelligence},
  year      = {2022},
  pages     = {610-620},
  volume    = {180},
  url       = {https://mlanthology.org/uai/2022/dun2022uai-resist/}
}