Mixtures of Subspaces for Bandwidth Efficient Context Parallel Training
Abstract
Pretraining language models with extended context windows enhances their ability to leverage rich information during generation. Existing methods split input sequences into chunks, broadcast them across multiple devices, and compute attention block by block which incurs significant communication overhead. While feasible in high-speed clusters, these methods are impractical for decentralized training over low-bandwidth connections. We propose a compression method for communication-efficient context parallelism in decentralized settings, achieving a remarkable compression rate of over 95% with negligible overhead and no loss in convergence. Our key insight is to exploit the intrinsic low-rank structure of activation outputs by dynamically constraining them to learned mixtures of subspaces via efficient reparameterizations. We demonstrate scaling billion-parameter decentralized models to context lengths exceeding 100K tokens on networks as slow as 300Mbps, matching the wall-clock convergence speed of centralized models on 100Gbps interconnects.
Cite
Text
Ramasinghe et al. "Mixtures of Subspaces for Bandwidth Efficient Context Parallel Training." Advances in Neural Information Processing Systems, 2025.Markdown
[Ramasinghe et al. "Mixtures of Subspaces for Bandwidth Efficient Context Parallel Training." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/ramasinghe2025neurips-mixtures/)BibTeX
@inproceedings{ramasinghe2025neurips-mixtures,
title = {{Mixtures of Subspaces for Bandwidth Efficient Context Parallel Training}},
author = {Ramasinghe, Sameera and Ajanthan, Thalaiyasingam and Dolatabadi, Hadi Mohaghegh and Avraham, Gil and Shevchenko, Violetta and Zuo, Yan and Koneputugodage, Chamin P Hewa and Long, Alexander},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/ramasinghe2025neurips-mixtures/}
}