BiE: Bi-Exponent Block Floating-Point for Large Language Models Quantization

ICML 2024 pp. 62978-62992

Abstract

Nowadays, Large Language Models (LLMs) mostly possess billions of parameters, bringing significant challenges to hardware platforms. Although quantization is an efficient approach to reduce computation and memory overhead for inference optimization, we stress the challenge that mainstream low-bit quantization approaches still suffer from either various data distribution outliers or a lack of hardware efficiency. We also find that low-bit data format has further potential expressiveness to cover the atypical language data distribution. In this paper, we propose a novel numerical representation, Bi-Exponent Block Floating Point (BiE), and a new quantization flow. BiE quantization shows accuracy superiority and hardware friendliness on various models and benchmarks.

Cite

Text

Zou et al. "BiE: Bi-Exponent Block Floating-Point for Large Language Models Quantization." International Conference on Machine Learning, 2024.

Markdown

[Zou et al. "BiE: Bi-Exponent Block Floating-Point for Large Language Models Quantization." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/zou2024icml-bie/)

BibTeX

@inproceedings{zou2024icml-bie,
  title     = {{BiE: Bi-Exponent Block Floating-Point for Large Language Models Quantization}},
  author    = {Zou, Lancheng and Zhao, Wenqian and Yin, Shuo and Bai, Chen and Sun, Qi and Yu, Bei},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {62978-62992},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/zou2024icml-bie/}
}