Enhancing Stability for Large Models Training in Constrained Bandwidth Networks

Abstract

Training extremely large language models with billions of parameters is a computationally inten- sive task that pushes the limits of current data- parallel training systems. While techniques like ZeRO++ (Wang et al., 2024) have enabled effi- cient distributed training of such giant models on inexpensive low-bandwidth clusters, they can suffer from convergence issues due to potential race conditions in the hierarchical partitioning (hpZ) scheme employed to reduce cross-machine communication. In this work, we first show how these race conditions cause instability when train- ing models with billions of parameters. We then propose a modification to the partitioning algo- rithm that addresses these convergence challenges while maintaining competitive training efficiency. Empirical evaluation on training the multi-billion parameters Falcon Models and LLama-2 models demonstrates the updated algorithm’s ability to achieve reliable convergence on these massive models, where stock ZeRO++ hpZ fails to con- verge. The updated algorithm enables robust train- ing of larger models with 98% throughput and model training speed improvement without sacri- ficing the quality of convergence.

Cite

Text

Dai et al. "Enhancing Stability for Large Models Training in Constrained Bandwidth Networks." ICML 2024 Workshops: ES-FoMo-II, 2024.

Markdown

[Dai et al. "Enhancing Stability for Large Models Training in Constrained Bandwidth Networks." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/dai2024icmlw-enhancing/)

BibTeX

@inproceedings{dai2024icmlw-enhancing,
  title     = {{Enhancing Stability for Large Models Training in Constrained Bandwidth Networks}},
  author    = {Dai, Yun and Dharamsi, Tejas and Hsu, Pin-Lun and Song, Tao and Firooz, Hamed},
  booktitle = {ICML 2024 Workshops: ES-FoMo-II},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/dai2024icmlw-enhancing/}
}