ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals
Abstract
Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama and Qwen2.5 families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33% lower perplexity on Wikitext than the next best method SpinQuant, and upto 3X speedup over 16-bit baseline. Anonymous code repository available at https://anonymous.4open.science/r/project-resq-2142.
Cite
Text
Saxena et al. "ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Saxena et al. "ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/saxena2025icml-resq/)BibTeX
@inproceedings{saxena2025icml-resq,
title = {{ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals}},
author = {Saxena, Utkarsh and Sharify, Sayeh and Roy, Kaushik and Wang, Xin},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {53095-53114},
volume = {267},
url = {https://mlanthology.org/icml/2025/saxena2025icml-resq/}
}