ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Abstract

Quantizing weights, activations, and KV cache in large language models to 4-bit without degrading generalizability is challenging due to outlier-induced activation quantization errors. We propose ResQ, a PTQ method that uses principal component analysis to identify a low-rank subspace (in practice 1/8 of the hidden dimension) and keeps coefficients within this subspace in 8-bit while quantizing the rest in 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. ResQ outperforms recent PTQ methods on Llama and Qwen2.5, achieving up to 33% lower Wikitext perplexity than SpinQuant and up to 3x speedup over 16-bit.

Cite

Text

Saxena et al. "ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals." ICLR 2025 Workshops: SCOPE, 2025.

Markdown

[Saxena et al. "ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals." ICLR 2025 Workshops: SCOPE, 2025.](https://mlanthology.org/iclrw/2025/saxena2025iclrw-resq/)

BibTeX

@inproceedings{saxena2025iclrw-resq,
  title     = {{ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals}},
  author    = {Saxena, Utkarsh and Sharify, Sayeh and Roy, Kaushik and Wang, Xin},
  booktitle = {ICLR 2025 Workshops: SCOPE},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/saxena2025iclrw-resq/}
}