Occult: Optimizing Collaborative Communications Across Experts for Accelerated Parallel MoE Training and Inference
Abstract
Mixture-of-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over 40% runtime in large-scale training). In this paper, we first define $\textit{collaborative communication}$ to illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them as $\textit{collaborated}$, which comprises $2$ cases as $\textit{intra-}$ and $\textit{inter-collaboration}$, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of intra-collaboration can accelerate expert parallel at scale. It motivates us to strategically $\underline{\texttt{o}}$ptimize $\underline{\texttt{c}}$ollaborative $\underline{\texttt{c}}$omm$\underline{\texttt{u}}$nication for acce$\underline{\texttt{l}}$era$\underline{\texttt{t}}$ed MoE training and inference, dubbed $\textbf{\texttt{Occult}}$. Our designs are capable of $\underline{either}$ delivering exact results with reduced communication cost, $\underline{or}$ controllably minimizing the cost with collaboration pruning, materialized by modified fine-tuning. Comprehensive experiments on various MoE-LLMs demonstrate that $\texttt{Occult}$ can be faster than popular state-of-the-art inference or training frameworks (over 50% speed up across multiple tasks and models) with comparable or superior quality compared to the standard fine-tuning. Codes will be available upon acceptance.
Cite
Text
Luo et al. "Occult: Optimizing Collaborative Communications Across Experts for Accelerated Parallel MoE Training and Inference." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Luo et al. "Occult: Optimizing Collaborative Communications Across Experts for Accelerated Parallel MoE Training and Inference." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/luo2025icml-occult/)BibTeX
@inproceedings{luo2025icml-occult,
title = {{Occult: Optimizing Collaborative Communications Across Experts for Accelerated Parallel MoE Training and Inference}},
author = {Luo, Shuqing and Li, Pingzhi and Peng, Jie and Zhao, Yang and Cao, Yu and Cheng, Yu and Chen, Tianlong},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {41235-41253},
volume = {267},
url = {https://mlanthology.org/icml/2025/luo2025icml-occult/}
}