Mamba-PTQ: Outlier Channels in Recurrent Large Language Models

Abstract

Modern recurrent layers are emerging as a promising path toward edge deployment of foundation models, especially in the context of large language models (LLMs). Compressing the whole input sequence in a finite-dimensional representation enables recurrent layers to model long-range dependencies while maintaining a constant inference cost for each token and a fixed memory requirement. However, the practical deployment of LLMs in resource-limited environments often requires further model compression, such as quantization and pruning. While these techniques are well-established for attention-based models, their effects on recurrent layers remain underexplored. In this preliminary work, we focus on post-training quantization for recurrent LLMs and show that Mamba models exhibit the same pattern of outlier channels observed in attention-based LLMs. We show that the reason for difficulty of quantizing SSMs is caused by activation outliers, similar to those observed in transformer-based LLMs. We report baseline results for post-training quantization of Mamba that do not take into account the activation outliers and suggest first steps for outlier-aware quantization.

Cite

Text

Pierro and Abreu. "Mamba-PTQ: Outlier Channels in Recurrent Large Language Models." ICML 2024 Workshops: ES-FoMo-II, 2024.

Markdown

[Pierro and Abreu. "Mamba-PTQ: Outlier Channels in Recurrent Large Language Models." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/pierro2024icmlw-mambaptq/)

BibTeX

@inproceedings{pierro2024icmlw-mambaptq,
  title     = {{Mamba-PTQ: Outlier Channels in Recurrent Large Language Models}},
  author    = {Pierro, Alessandro and Abreu, Steven},
  booktitle = {ICML 2024 Workshops: ES-FoMo-II},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/pierro2024icmlw-mambaptq/}
}