Efficient Hybrid Language Model Compression Through Group-Aware SSM Pruning

Abstract

Hybrid language models that combine Attention and State Space Models (SSMs) have been shown to achieve state-of-the-art accuracy and runtime performance. Recent work has also demonstrated that applying pruning and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. To this end, we introduce a novel group-aware pruning method for Mamba layers that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. We combine this method with FFN, embedding dimension, and layer pruning, along with knowledge distillation-based retraining to obtain a unified compression recipe for hybrid models. Using this recipe, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to $40\times$ fewer training tokens compared to similarly-sized models. The resulting model surpasses the accuracy of similarly-sized models while achieving $\sim2\times$ faster inference throughput, significantly advancing the Pareto frontier.

Cite

Text

Taghibakhshi et al. "Efficient Hybrid Language Model Compression Through Group-Aware SSM Pruning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Taghibakhshi et al. "Efficient Hybrid Language Model Compression Through Group-Aware SSM Pruning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/taghibakhshi2025neurips-efficient/)

BibTeX

@inproceedings{taghibakhshi2025neurips-efficient,
  title     = {{Efficient Hybrid Language Model Compression Through Group-Aware SSM Pruning}},
  author    = {Taghibakhshi, Ali and Sreenivas, Sharath Turuvekere and Muralidharan, Saurav and Chochowski, Marcin and Karnati, Yashaswi and Joshi, Raviraj Bhuminand and Mahabaleshwarkar, Ameya Sunil and Chen, Zijia and Suhara, Yoshi and Olabiyi, Oluwatobi and Korzekwa, Daniel and Patwary, Mostofa and Shoeybi, Mohammad and Kautz, Jan and Catanzaro, Bryan and Aithal, Ashwath and Tajbakhsh, Nima and Molchanov, Pavlo},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/taghibakhshi2025neurips-efficient/}
}