SLAMB: Accelerated Large Batch Training with Sparse Communication
Abstract
Distributed training of large deep neural networks requires frequent exchange of massive data between machines, thus communication efficiency is a major concern. Existing compressed communication methods are either not compatible with large batch optimization algorithms, or do not provide sufficient speedup in large scale. In this paper, we combine sparsification-based gradient compression with the layer-wise adaptive moments optimizer for large batch training (LAMB). We propose SLAMB, a novel communication-efficient optimizer that supports large batch sizes and scales to thousands of GPUs. SLAMB employs momentum masking, local error compensation, and element-wise adaptive rescaling to achieve accurate layer-wise weight updates, which translates to fast convergence for very large batches. Our empirical results show that, compared to the state-of-the-art, SLAMB transmits half the amount of data in large-batch BERT pre-training, without sacrificing accuracy. Moreover, SLAMB achieves excellent scalability in large computing infrastructures. For instance, SLAMB with 128 GPUs reduces the training time of Swin Transformer pre-training on ImageNet to 5.35 hours, which is 2 hours faster than the state-of-the-art. At the extreme, we trained BERT-XL (2.8B parameters) on 1,024 NVIDIA A100 GPUs, where SLAMB achieved 90% scaling efficiency.
Cite
Text
Xu et al. "SLAMB: Accelerated Large Batch Training with Sparse Communication." International Conference on Machine Learning, 2023.Markdown
[Xu et al. "SLAMB: Accelerated Large Batch Training with Sparse Communication." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/xu2023icml-slamb/)BibTeX
@inproceedings{xu2023icml-slamb,
title = {{SLAMB: Accelerated Large Batch Training with Sparse Communication}},
author = {Xu, Hang and Zhang, Wenxuan and Fei, Jiawei and Wu, Yuzhe and Xie, Tingwen and Huang, Jun and Xie, Yuchen and Elhoseiny, Mohamed and Kalnis, Panos},
booktitle = {International Conference on Machine Learning},
year = {2023},
pages = {38801-38825},
volume = {202},
url = {https://mlanthology.org/icml/2023/xu2023icml-slamb/}
}