DeltaMoE: Memory-Efficient Inference for Merged Mixture of Experts with Delta Compression

Abstract

Sparse Mixture of Experts (SMoEs) have emerged as an efficient architecture for large language models. While recent community efforts have focused on merging multiple models to create SMoEs, deploying these merged models remains challenging due to their substantial memory requirements. In this paper, we present DeltaMoE, a training-free delta compression pipeline that enables efficient deployment of SMoE models through structured sparsity and quantization. Our evaluation shows that DeltaMoE achieves up to a $2.34\times$ compression ratio and $2.57\times$ throughput improvement. DeltaMoE is also scalable with the number of experts, making it particularly suitable for large SMoE models.

Cite

Text

Borisov et al. "DeltaMoE: Memory-Efficient Inference for Merged Mixture of Experts with Delta Compression." ICLR 2025 Workshops: SLLM, 2025.

Markdown

[Borisov et al. "DeltaMoE: Memory-Efficient Inference for Merged Mixture of Experts with Delta Compression." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/borisov2025iclrw-deltamoe/)

BibTeX

@inproceedings{borisov2025iclrw-deltamoe,
  title     = {{DeltaMoE: Memory-Efficient Inference for Merged Mixture of Experts with Delta Compression}},
  author    = {Borisov, Boyko and Yao, Xiaozhe and Gürel, Nezihe Merve and Klimovic, Ana},
  booktitle = {ICLR 2025 Workshops: SLLM},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/borisov2025iclrw-deltamoe/}
}