Block and Subword-Scaling Floating-Point (BSFP) : An Efficient Non-Uniform Quantization for Low Precision Inference
Abstract
In this paper, we propose Block and Subword-Scaling Floating-Point (BSFP), a non-uniform quantization scheme for the skewed and non-uniform distribution of weight vectors in neural networks. By quantizing each weight vector as the superposition of multiple subword vectors (in two's complement) with scaling factors (in Low-bit Floating-Point, LBFP), BSFP can effectively fit the distribution of weight vectors while maintaining high computation efficiency. Furthermore, we present a grid search-based MSE-optimal quantization flow and a scaled serial processing engine to complete the quantization pipeline and the infrastructure. The experimental results on the ImageNet classification task show that our proposed method outperforms state-of-the-art Microsoft Floating Point (MSFP) by up to 20.56% top-1 accuracy at the same weight precision and reduces up to 10.3% model size. Furthermore, BSFP outperforms MSFP by up to 2.0$\times$ computing throughput and up to 5.3$\times$ energy efficiency under the same silicon area budget.
Cite
Text
Lo et al. "Block and Subword-Scaling Floating-Point (BSFP) : An Efficient Non-Uniform Quantization for Low Precision Inference." International Conference on Learning Representations, 2023.Markdown
[Lo et al. "Block and Subword-Scaling Floating-Point (BSFP) : An Efficient Non-Uniform Quantization for Low Precision Inference." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/lo2023iclr-block/)BibTeX
@inproceedings{lo2023iclr-block,
title = {{Block and Subword-Scaling Floating-Point (BSFP) : An Efficient Non-Uniform Quantization for Low Precision Inference}},
author = {Lo, Yun-Chen and Lee, Tse-Kuang and Liu, Ren-Shuo},
booktitle = {International Conference on Learning Representations},
year = {2023},
url = {https://mlanthology.org/iclr/2023/lo2023iclr-block/}
}