QuIP$#$: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

Abstract

Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes ($\le$ 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP’s (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric $E_8$ lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. Our code can be found at https://github.com/Cornell-RelaxML/quip-sharp.

Cite

Text

Tseng et al. "QuIP$#$: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks." International Conference on Machine Learning, 2024.

Markdown

[Tseng et al. "QuIP$#$: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/tseng2024icml-quip/)

BibTeX

@inproceedings{tseng2024icml-quip,
  title     = {{QuIP$#$: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks}},
  author    = {Tseng, Albert and Chee, Jerry and Sun, Qingyao and Kuleshov, Volodymyr and De Sa, Christopher},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {48630-48656},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/tseng2024icml-quip/}
}