Quantized Distributed Training of Large Models with Convergence Guarantees

Abstract

Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs). The recent emergence of large language models such as GPT has created the need for new approaches to exploit data-parallelism. Among these, fully-sharded data parallel (FSDP) training is highly popular, yet it still encounters scalability bottlenecks. One reason is that applying compression techniques to FSDP is challenging: as the vast majority of the communication involves the model’s weights, direct compression alters convergence and leads to accuracy loss. We present QSDP, a variant of FSDP which supports both gradient and weight quantization with theoretical guarantees, is simple to implement and has essentially no overheads. To derive QSDP we prove that a natural modification of SGD achieves convergence even when we only maintain quantized weights, and thus the domain over which we train consists of quantized points and is, therefore, highly non-convex. We validate this approach by training GPT-family models with up to 1.3 billion parameters on a multi-node cluster. Experiments show that QSDP preserves model accuracy, while completely removing the communication bottlenecks of FSDP, providing end-to-end speedups of up to 2.2x.

Cite

Text

Markov et al. "Quantized Distributed Training of Large Models with Convergence Guarantees." International Conference on Machine Learning, 2023.

Markdown

[Markov et al. "Quantized Distributed Training of Large Models with Convergence Guarantees." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/markov2023icml-quantized/)

BibTeX

@inproceedings{markov2023icml-quantized,
  title     = {{Quantized Distributed Training of Large Models with Convergence Guarantees}},
  author    = {Markov, Ilia and Vladu, Adrian and Guo, Qi and Alistarh, Dan},
  booktitle = {International Conference on Machine Learning},
  year      = {2023},
  pages     = {24020-24044},
  volume    = {202},
  url       = {https://mlanthology.org/icml/2023/markov2023icml-quantized/}
}