Matryoshka Quantization

Abstract

Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 - requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. Leveraging this insight, in this paper, we propose Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that alleviates the aforementioned challenge. This technique allows us to train and maintain a single quantized model but serve it with the precision demanded by the deployment. Furthermore, leveraging MatQuant's co-training and co-distillation regularization, int2 precision models extracted by MatQuant outperform standard int2 quantization by up to to 4% and 7% with OmniQuant and QAT as base algorithms respectively.

Cite

Text

Nair et al. "Matryoshka Quantization." ICLR 2025 Workshops: SLLM, 2025.

Markdown

[Nair et al. "Matryoshka Quantization." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/nair2025iclrw-matryoshka/)

BibTeX

@inproceedings{nair2025iclrw-matryoshka,
  title     = {{Matryoshka Quantization}},
  author    = {Nair, Pranav Ajit and Datta, Puranjay and Dean, Jeff and Jain, Prateek and Kusupati, Aditya},
  booktitle = {ICLR 2025 Workshops: SLLM},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/nair2025iclrw-matryoshka/}
}