Matryoshka Quantization
Abstract
Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 - requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. Leveraging this insight, in this paper, we propose Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that alleviates the aforementioned challenge. This technique allows us to train and maintain a single quantized model but serve it with the precision demanded by the deployment. Furthermore, leveraging MatQuant's co-training and co-distillation regularization, int2 precision models extracted by MatQuant outperform standard int2 quantization by up to to 4% and 7% with OmniQuant and QAT as base algorithms respectively.
Cite
Text
Nair et al. "Matryoshka Quantization." ICLR 2025 Workshops: SLLM, 2025.Markdown
[Nair et al. "Matryoshka Quantization." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/nair2025iclrw-matryoshka/)BibTeX
@inproceedings{nair2025iclrw-matryoshka,
title = {{Matryoshka Quantization}},
author = {Nair, Pranav Ajit and Datta, Puranjay and Dean, Jeff and Jain, Prateek and Kusupati, Aditya},
booktitle = {ICLR 2025 Workshops: SLLM},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/nair2025iclrw-matryoshka/}
}