Maximizing Communication Efficiency for Large-Scale Training via 0/1 Adam
Abstract
1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD. Their benefits, however, remain an open question on Adam-based large model pre-training (e.g. BERT and GPT). In this paper, we demonstrate the non-linearity in Adam causes slow convergence even when 1-bit compression or local steps are individually applied. To alleviate this limitation, we propose \textbf{0/1 Adam} that linearizes each Adam step via approximating its optimizer states using their stale estimates and linear correlation. \textbf{0/1 Adam} performs an Adam-like step to preserve the adaptivity, while its linearity allows utilizing 1-bit compression and local steps simultaneously for wall-clock time speed up. We provide convergence guarantee for \textbf{0/1 Adam} on smooth non-convex objectives. On various large-scale benchmarks such as BERT-Base, BERT-Large, GPT-2 pre-training and ImageNet, we demonstrate on up to 128 GPUs that \textbf{0/1 Adam} is able to reduce up to 87\% of data volume, 54\% of communication rounds, and achieve up to 2$\times$ higher training throughput and end-to-end training time reduction compared to the state-of-the-art baseline 1-bit Adam; while enjoying the same statistical convergence speed and end task model accuracy on GLUE dataset and ImageNet validation set.
Cite
Text
Lu et al. "Maximizing Communication Efficiency for Large-Scale Training via 0/1 Adam." International Conference on Learning Representations, 2023.Markdown
[Lu et al. "Maximizing Communication Efficiency for Large-Scale Training via 0/1 Adam." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/lu2023iclr-maximizing/)BibTeX
@inproceedings{lu2023iclr-maximizing,
title = {{Maximizing Communication Efficiency for Large-Scale Training via 0/1 Adam}},
author = {Lu, Yucheng and Li, Conglong and Zhang, Minjia and De Sa, Christopher and He, Yuxiong},
booktitle = {International Conference on Learning Representations},
year = {2023},
url = {https://mlanthology.org/iclr/2023/lu2023iclr-maximizing/}
}