REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression
Abstract
Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert *merging* on discriminative benchmarks, we find that expert *pruning* is a superior strategy for generative tasks. We demonstrate that existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms to minimize the reconstruction error bound. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.
Cite
Text
Lasby et al. "REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression." International Conference on Learning Representations, 2026.Markdown
[Lasby et al. "REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/lasby2026iclr-reap/)BibTeX
@inproceedings{lasby2026iclr-reap,
title = {{REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression}},
author = {Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/lasby2026iclr-reap/}
}