MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

Abstract

The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third—a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics—Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)—to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios. This benchmark is available on Github: https://github.com/Auto-CAP/MoE-CAP.

Cite

Text

Jiang et al. "MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems." Advances in Neural Information Processing Systems, 2025.

Markdown

[Jiang et al. "MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/jiang2025neurips-moecap/)

BibTeX

@inproceedings{jiang2025neurips-moecap,
  title     = {{MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems}},
  author    = {Jiang, Yinsicheng and Fu, Yao and Huang, Yeqi and Nie, Ping and Lu, Zhan and Xue, Leyang and He, Congjie and Sit, Man-Kit and Xue, Jilong and Dong, Li and Miao, Ziming and Du, DaYou and Xu, Tairan and Zou, Kai and Ponti, Edoardo and Mai, Luo},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/jiang2025neurips-moecap/}
}