TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Abstract

Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe

Cite

Text

Li et al. "TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models." International Conference on Machine Learning, 2021.

Markdown

[Li et al. "TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models." International Conference on Machine Learning, 2021.](https://mlanthology.org/icml/2021/li2021icml-terapipe/)

BibTeX

@inproceedings{li2021icml-terapipe,
  title     = {{TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models}},
  author    = {Li, Zhuohan and Zhuang, Siyuan and Guo, Shiyuan and Zhuo, Danyang and Zhang, Hao and Song, Dawn and Stoica, Ion},
  booktitle = {International Conference on Machine Learning},
  year      = {2021},
  pages     = {6543-6552},
  volume    = {139},
  url       = {https://mlanthology.org/icml/2021/li2021icml-terapipe/}
}