GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

ICML 2025 pp. 77911-77925

/icml/2025/zhao2025icml-ganq/

Abstract

Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native support for mixed-precision General Matrix Multiplication (mpGEMM), resulting in inefficient dequantization-based implementations. Moreover, uniform quantization methods often fail to capture weight distributions adequately, leading to performance degradation. We propose GANQ (GPU-Adaptive Non-Uniform Quantization), a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM. GANQ achieves superior quantization performance by utilizing a training-free, GPU-adaptive optimization algorithm to efficiently reduce layer-wise quantization errors. Extensive experiments demonstrate GANQ’s ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization. Furthermore, when deployed on a single NVIDIA RTX 4090 GPU, GANQ’s quantized models achieve up to 2.57$\times$ speedup over the baseline, advancing memory and inference efficiency in LLM deployment.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Zhao and Yuan. "GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Zhao and Yuan. "GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/zhao2025icml-ganq/)

BibTeX

@inproceedings{zhao2025icml-ganq,
  title     = {{GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models}},
  author    = {Zhao, Pengxiang and Yuan, Xiaoming},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {77911-77925},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/zhao2025icml-ganq/}
}