SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Abstract

Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise with high accuracy. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: 1) Salience-Determined Bit Allocation adaptively assigns bit-widths to groups within each layer based on their salience; and 2) Salience-Weighted Quantizer Calibration optimizes quantizer parameters by incorporating element-level salience, retain essential information. With its structured group-wise partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while significantly improving accuracy. Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM+, which incorporates gradient-based quantization, further reduces perplexity by 35.1%. Our code is available at https://github.com/Aaronhuang-778/SliM-LLM.

Cite

Text

Huang et al. "SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Huang et al. "SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/huang2025icml-slimllm/)

BibTeX

@inproceedings{huang2025icml-slimllm,
  title     = {{SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models}},
  author    = {Huang, Wei and Qin, Haotong and Liu, Yangdong and Li, Yawei and Liu, Qinshuo and Liu, Xianglong and Benini, Luca and Magno, Michele and Zhang, Shiming and Qi, Xiaojuan},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {25672-25692},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/huang2025icml-slimllm/}
}