F8Net: Fixed-Point 8-Bit Only Multiplication for Network Quantization

Abstract

Neural network quantization is a promising compression technique to reduce memory footprint and save energy consumption, potentially leading to real-time inference. However, there is a performance gap between quantized and full-precision models. To reduce it, existing quantization approaches require high-precision INT32 or full-precision multiplication during inference for scaling or dequantization. This introduces a noticeable cost in terms of memory, speed, and required energy. To tackle these issues, we present F8Net, a novel quantization framework consisting in only fixed-point 8-bit multiplication. To derive our method, we first discuss the advantages of fixed-point multiplication with different formats of fixed-point numbers and study the statistical behavior of the associated fixed-point numbers. Second, based on the statistical and algorithmic analysis, we apply different fixed-point formats for weights and activations of different layers. We introduce a novel algorithm to automatically determine the right format for each layer during training. Third, we analyze a previous quantization algorithm—parameterized clipping activation (PACT)—and reformulate it using fixed-point arithmetic. Finally, we unify the recently proposed method for quantization fine-tuning and our fixed-point approach to show the potential of our method. We verify F8Net on ImageNet for MobileNet V1/V2 and ResNet18/50. Our approach achieves comparable and better performance, when compared not only to existing quantization techniques with INT32 multiplication or floating point arithmetic, but also to the full-precision counterparts, achieving state-of-the-art performance.

Cite

Text

Jin et al. "F8Net: Fixed-Point 8-Bit Only Multiplication for Network Quantization." International Conference on Learning Representations, 2022.

Markdown

[Jin et al. "F8Net: Fixed-Point 8-Bit Only Multiplication for Network Quantization." International Conference on Learning Representations, 2022.](https://mlanthology.org/iclr/2022/jin2022iclr-f8net/)

BibTeX

@inproceedings{jin2022iclr-f8net,
  title     = {{F8Net: Fixed-Point 8-Bit Only Multiplication for Network Quantization}},
  author    = {Jin, Qing and Ren, Jian and Zhuang, Richard and Hanumante, Sumant and Li, Zhengang and Chen, Zhiyu and Wang, Yanzhi and Yang, Kaiyuan and Tulyakov, Sergey},
  booktitle = {International Conference on Learning Representations},
  year      = {2022},
  url       = {https://mlanthology.org/iclr/2022/jin2022iclr-f8net/}
}