Kurtail : Kurtosis-Based LLM Quantization

Abstract

One of the challenges of quantizing a large language model (LLM) is the presence of outliers. Outliers often make uniform quantization schemes less effective, particularly in extreme cases such as 4-bit quantization. We introduce KurTail, a new post-training quantization (PTQ) scheme that leverages Kurtosis-based rotation to mitigate outliers in the activations of LLMs. Our method optimizes Kurtosis as a measure of tailedness. This approach enables the quantization of weights, activations, and the KV cache in 4 bits. KurTail utilizes layer-wise optimization, ensuring memory efficiency. KurTail outperforms existing quantization methods, offering a 13.3\% boost in MMLU accuracy and a 15.5\% drop in Wiki perplexity compared to QuaRot. It also outperforms SpinQuant with a 2.6\% MMLU gain and reduces perplexity by 2.9\%, all while reducing the cost of training the rotation. For comparison, learning the rotation using SpinQuant for Llama3-70B requires at least four NVIDIA H100 80GB GPUs, whereas our method requires only a single GPU, making it a more accessible solution for consumer GPU.

Cite

Text

Akhondzadeh et al. "Kurtail : Kurtosis-Based LLM Quantization." ICLR 2025 Workshops: SLLM, 2025.

Markdown

[Akhondzadeh et al. "Kurtail : Kurtosis-Based LLM Quantization." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/akhondzadeh2025iclrw-kurtail/)

BibTeX

@inproceedings{akhondzadeh2025iclrw-kurtail,
  title     = {{Kurtail : Kurtosis-Based LLM Quantization}},
  author    = {Akhondzadeh, Mohammad Sadegh and Bojchevski, Aleksandar and Eleftheriou, Evangelos and Dazzi, Martino},
  booktitle = {ICLR 2025 Workshops: SLLM},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/akhondzadeh2025iclrw-kurtail/}
}