CDQuant: Greedy Coordinate Descent for Accurate LLM Quantization
Abstract
Quantization has emerged as a key technique for compressing large models with minimal impact on performance. The recent GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs, sparking a wave of research that leverages GPTQ as a core component. Recognizing the pivotal role of GPTQ in the PTQ landscape, we introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance. CDQuant uses greedy coordinate descent to minimize the layer-wise reconstruction loss to achieve high-quality quantized weights. Our algorithm is easy to implement and scales efficiently to models with hundreds of billions of parameters. We perform extensive evaluation on Gemma, and PaLM2 model families, and demonstrate that CDQuant consistently outperforms GPTQ in 2-4 bit weight quantization. Moreover, CDQuant improves the performance of state-of-the-art PTQ techniques such as QuIP and FrameQuant when used as a replacement for their GPTQ component, resulting in further gains in quality.
Cite
Text
Nair and Suggala. "CDQuant: Greedy Coordinate Descent for Accurate LLM Quantization." NeurIPS 2024 Workshops: Compression, 2024.Markdown
[Nair and Suggala. "CDQuant: Greedy Coordinate Descent for Accurate LLM Quantization." NeurIPS 2024 Workshops: Compression, 2024.](https://mlanthology.org/neuripsw/2024/nair2024neuripsw-cdquant/)BibTeX
@inproceedings{nair2024neuripsw-cdquant,
title = {{CDQuant: Greedy Coordinate Descent for Accurate LLM Quantization}},
author = {Nair, Pranav Ajit and Suggala, Arun},
booktitle = {NeurIPS 2024 Workshops: Compression},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/nair2024neuripsw-cdquant/}
}