Beyond Top-K: Structured Sparsification for Compression in Pipeline Parallel

Abstract

In decentralized training, efficient communication is critical, particularly when training large-scale models over low-bandwidth, heterogeneous networks. Although gradient compression techniques have proven effective in Distributed Data-Parallel (DDP) settings, extending them to pipeline parallel (PP) training is challenging due to cumulative compression errors that exacerbate with network depth. In this work, we introduce a novel compression framework for PP that preserves the column space of activations and gradients instead of compressing individual elements. We derive tight theoretical error bounds and demonstrate the effectiveness of our method by training models over 80 Mbps connections, achieving up to 90\% compression along with around $2 \times$ training and $12 \times$ inference throughput improvements.

Cite

Text

Ramasinghe et al. "Beyond Top-K: Structured Sparsification for Compression in Pipeline Parallel." ICLR 2025 Workshops: MCDC, 2025.

Markdown

[Ramasinghe et al. "Beyond Top-K: Structured Sparsification for Compression in Pipeline Parallel." ICLR 2025 Workshops: MCDC, 2025.](https://mlanthology.org/iclrw/2025/ramasinghe2025iclrw-beyond/)

BibTeX

@inproceedings{ramasinghe2025iclrw-beyond,
  title     = {{Beyond Top-K: Structured Sparsification for Compression in Pipeline Parallel}},
  author    = {Ramasinghe, Sameera and Ajanthan, Thalaiyasingam and Avraham, Gil and Zuo, Yan and Long, Alexander},
  booktitle = {ICLR 2025 Workshops: MCDC},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/ramasinghe2025iclrw-beyond/}
}