Block and Subword-Scaling Floating-Point (BSFP) : An Efficient Non-Uniform Quantization for Low Precision Inference

Abstract

In this paper, we propose Block and Subword-Scaling Floating-Point (BSFP), a non-uniform quantization scheme for the skewed and non-uniform distribution of weight vectors in neural networks. By quantizing each weight vector as the superposition of multiple subword vectors (in two's complement) with scaling factors (in Low-bit Floating-Point, LBFP), BSFP can effectively fit the distribution of weight vectors while maintaining high computation efficiency. Furthermore, we present a grid search-based MSE-optimal quantization flow and a scaled serial processing engine to complete the quantization pipeline and the infrastructure. The experimental results on the ImageNet classification task show that our proposed method outperforms state-of-the-art Microsoft Floating Point (MSFP) by up to 20.56% top-1 accuracy at the same weight precision and reduces up to 10.3% model size. Furthermore, BSFP outperforms MSFP by up to 2.0$\times$ computing throughput and up to 5.3$\times$ energy efficiency under the same silicon area budget.

Cite

Text

Lo et al. "Block and Subword-Scaling Floating-Point (BSFP) : An Efficient Non-Uniform Quantization for Low Precision Inference." International Conference on Learning Representations, 2023.

Markdown

[Lo et al. "Block and Subword-Scaling Floating-Point (BSFP) : An Efficient Non-Uniform Quantization for Low Precision Inference." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/lo2023iclr-block/)

BibTeX

@inproceedings{lo2023iclr-block,
  title     = {{Block and Subword-Scaling Floating-Point (BSFP) : An Efficient Non-Uniform Quantization for Low Precision Inference}},
  author    = {Lo, Yun-Chen and Lee, Tse-Kuang and Liu, Ren-Shuo},
  booktitle = {International Conference on Learning Representations},
  year      = {2023},
  url       = {https://mlanthology.org/iclr/2023/lo2023iclr-block/}
}