Post-Training Sparsity-Aware Quantization

Shomron, Gil; Gabbay, Freddy; Kurzum, Samer; Weiser, Uri

Post-Training Sparsity-Aware Quantization

Gil Shomron, Freddy Gabbay, Samer Kurzum, Uri Weiser

NeurIPS 2021

/neurips/2021/shomron2021neurips-posttraining/

Abstract

Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. Uniform post-training quantization (PTQ) methods are common, since they can be implemented efficiently in hardware and do not require extensive hardware resources or a training set. Mapping FP32 models to INT8 using uniform PTQ yields models with negligible accuracy degradation; however, reducing precision below 8 bits with PTQ is challenging, as accuracy degradation becomes noticeable, due to the increase in quantization noise. In this paper, we propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities. 4-bit quantization, for example, is employed by dynamically examining the bits of 8-bit values and choosing a window of 4 bits, while first skipping zero-value bits. Moreover, instead of quantizing activation-by-activation to 4 bits, we focus on pairs of 8-bit activations and examine whether one of the two is equal to zero. If one is equal to zero, the second can opportunistically use the other's 4-bit budget; if both do not equal zero, then each is dynamically quantized to 4 bits, as described. SPARQ achieves minor accuracy degradation and a practical hardware implementation.

PDF NeurIPS OpenReview Code Semantic Scholar

Cite

Text

Shomron et al. "Post-Training Sparsity-Aware Quantization." Neural Information Processing Systems, 2021.

Markdown

[Shomron et al. "Post-Training Sparsity-Aware Quantization." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/shomron2021neurips-posttraining/)

BibTeX

@inproceedings{shomron2021neurips-posttraining,
  title     = {{Post-Training Sparsity-Aware Quantization}},
  author    = {Shomron, Gil and Gabbay, Freddy and Kurzum, Samer and Weiser, Uri},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/shomron2021neurips-posttraining/}
}