SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models
Abstract
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise with high accuracy. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: 1) Salience-Determined Bit Allocation adaptively assigns bit-widths to groups within each layer based on their salience; and 2) Salience-Weighted Quantizer Calibration optimizes quantizer parameters by incorporating element-level salience, retain essential information. With its structured group-wise partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while significantly improving accuracy. Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM+, which incorporates gradient-based quantization, further reduces perplexity by 35.1%. Our code is available at https://github.com/Aaronhuang-778/SliM-LLM.
Cite
Text
Huang et al. "SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Huang et al. "SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/huang2025icml-slimllm/)BibTeX
@inproceedings{huang2025icml-slimllm,
title = {{SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models}},
author = {Huang, Wei and Qin, Haotong and Liu, Yangdong and Li, Yawei and Liu, Qinshuo and Liu, Xianglong and Benini, Luca and Magno, Michele and Zhang, Shiming and Qi, Xiaojuan},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {25672-25692},
volume = {267},
url = {https://mlanthology.org/icml/2025/huang2025icml-slimllm/}
}