Breadth-First Pipeline Parallelism

Abstract

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers the training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed increases of up to 53% in training speed.

Cite

Text

Lamy-Poirier. "Breadth-First Pipeline Parallelism." NeurIPS 2022 Workshops: HITY, 2022.

Markdown

[Lamy-Poirier. "Breadth-First Pipeline Parallelism." NeurIPS 2022 Workshops: HITY, 2022.](https://mlanthology.org/neuripsw/2022/lamypoirier2022neuripsw-breadthfirst/)

BibTeX

@inproceedings{lamypoirier2022neuripsw-breadthfirst,
  title     = {{Breadth-First Pipeline Parallelism}},
  author    = {Lamy-Poirier, Joel},
  booktitle = {NeurIPS 2022 Workshops: HITY},
  year      = {2022},
  url       = {https://mlanthology.org/neuripsw/2022/lamypoirier2022neuripsw-breadthfirst/}
}