CBQ: Cross-Block Quantization for Large Language Models

Abstract

Post-training quantization (PTQ) has played a pivotal role in compressing large language models (LLMs) at ultra-low costs. Although current PTQ methods have achieved promising results by addressing outliers and employing layer- or block-wise loss optimization techniques, they still suffer from significant performance degradation at ultra-low bits precision. To dissect this issue, we conducted an in-depth analysis of quantization errors specific to LLMs and surprisingly discovered that, unlike traditional sources of quantization errors, the growing number of model parameters, combined with the reduction in quantization bits, intensifies inter-layer and intra-layer dependencies, which severely impact quantization accuracy. This finding highlights a critical challenge in quantizing LLMs. To address this, we propose CBQ, a cross-block reconstruction-based PTQ method for LLMs. CBQ leverages a cross-block dependency to establish long-range dependencies across multiple blocks and integrates an adaptive LoRA-Rounding technique to manage intra-layer dependencies. To further enhance performance, CBQ incorporates a coarse-to-fine pre-processing mechanism for processing weights and activations. Extensive experiments show that CBQ achieves superior low-bit quantization (W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods across various LLMs and datasets. Notably, CBQ only takes 4.3 hours to quantize a weight-only quantization of a 4-bit LLAMA1-65B model, achieving a commendable trade off between performance and efficiency.

Cite

Text

Ding et al. "CBQ: Cross-Block Quantization for Large Language Models." International Conference on Learning Representations, 2025.

Markdown

[Ding et al. "CBQ: Cross-Block Quantization for Large Language Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/ding2025iclr-cbq/)

BibTeX

@inproceedings{ding2025iclr-cbq,
  title     = {{CBQ: Cross-Block Quantization for Large Language Models}},
  author    = {Ding, Xin and Liu, Xiaoyu and Tu, Zhijun and Zhang, Yun and Li, Wei and Hu, Jie and Chen, Hanting and Tang, Yehui and Xiong, Zhiwei and Yin, Baoqun and Wang, Yunhe},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/ding2025iclr-cbq/}
}