NestQuant: Nested Lattice Quantization for Matrix Products and LLMs

Abstract

Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Meta’s SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.

Cite

Text

Savkin et al. "NestQuant: Nested Lattice Quantization for Matrix Products and LLMs." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Savkin et al. "NestQuant: Nested Lattice Quantization for Matrix Products and LLMs." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/savkin2025icml-nestquant/)

BibTeX

@inproceedings{savkin2025icml-nestquant,
  title     = {{NestQuant: Nested Lattice Quantization for Matrix Products and LLMs}},
  author    = {Savkin, Semyon and Porat, Eitan and Ordentlich, Or and Polyanskiy, Yury},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {53042-53062},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/savkin2025icml-nestquant/}
}