CAT: Compression-Aware Training for Bandwidth Reduction

Abstract

One major obstacle hindering the ubiquitous use of CNNs for inference is their relatively high memory bandwidth requirements, which can be the primary energy consumer and throughput bottleneck in hardware accelerators. Inspired by quantization-aware training approaches, we propose a compression-aware training (CAT) method that involves training the model to allow better compression of weights and feature maps during neural network deployment. Our method trains the model to achieve low-entropy feature maps, enabling efficient compression at inference time using classical transform coding methods. CAT significantly improves the state-of-the-art results reported for quantization evaluated on various vision and NLP tasks, such as image classification (ImageNet), image detection (Pascal VOC), sentiment analysis (CoLa), and textual entailment (MNLI). For example, on ResNet-18, we achieve near baseline ImageNet accuracy with an average representation of only 1.5 bits per value with 5-bit quantization. Moreover, we show that entropy reduction of weights and activations can be applied together, further improving bandwidth reduction. Reference implementation is available.

Cite

Text

Baskin et al. "CAT: Compression-Aware Training for Bandwidth Reduction." Journal of Machine Learning Research, 2021.

Markdown

[Baskin et al. "CAT: Compression-Aware Training for Bandwidth Reduction." Journal of Machine Learning Research, 2021.](https://mlanthology.org/jmlr/2021/baskin2021jmlr-cat/)

BibTeX

@article{baskin2021jmlr-cat,
  title     = {{CAT: Compression-Aware Training for Bandwidth Reduction}},
  author    = {Baskin, Chaim and Chmiel, Brian and Zheltonozhskii, Evgenii and Banner, Ron and Bronstein, Alex M. and Mendelson, Avi},
  journal   = {Journal of Machine Learning Research},
  year      = {2021},
  pages     = {1-20},
  volume    = {22},
  url       = {https://mlanthology.org/jmlr/2021/baskin2021jmlr-cat/}
}