Simulated Quantization, Real Power Savings

Abstract

Reduced precision hardware-based matrix multiplication accelerators are commonly employed to reduce power consumption of neural network inference. Multiplier designs used in such accelerators possess an interesting property: When the same bit is 0 for two consecutive compute cycles, the multiplier consumes less power. In this paper we show that this effect can be used to reduce power consumption of neural networks by simulating low bit-width quantization on higher bit-width hardware. We show that simulating 4 bit quantization on 8 bit hardware can yield up to 17% relative reduction in power consumption on commonly used networks. Furthermore, we show that in this context, bit operations (BOPs) are a good proxy for power efficiency, and that learning mixed-precision configurations that target lower BOPs can achieve better trade-offs between accuracy and power efficiency.

Cite

Text

van Baalen et al. "Simulated Quantization, Real Power Savings." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022. doi:10.1109/CVPRW56347.2022.00311

Markdown

[van Baalen et al. "Simulated Quantization, Real Power Savings." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022.](https://mlanthology.org/cvprw/2022/vanbaalen2022cvprw-simulated/) doi:10.1109/CVPRW56347.2022.00311

BibTeX

@inproceedings{vanbaalen2022cvprw-simulated,
  title     = {{Simulated Quantization, Real Power Savings}},
  author    = {van Baalen, Mart and Kahne, Brian and Mahurin, Eric and Kuzmin, Andrey and Skliar, Andrii and Nagel, Markus and Blankevoort, Tijmen},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2022},
  pages     = {2756-2760},
  doi       = {10.1109/CVPRW56347.2022.00311},
  url       = {https://mlanthology.org/cvprw/2022/vanbaalen2022cvprw-simulated/}
}