REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression

Abstract

Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert *merging* on discriminative benchmarks, we find that expert *pruning* is a superior strategy for generative tasks. We demonstrate that existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms to minimize the reconstruction error bound. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.

Cite

Text

Lasby et al. "REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression." International Conference on Learning Representations, 2026.

Markdown

[Lasby et al. "REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/lasby2026iclr-reap/)

BibTeX

@inproceedings{lasby2026iclr-reap,
  title     = {{REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression}},
  author    = {Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/lasby2026iclr-reap/}
}