Fused-Layer CNNs for Memory-Efficient Inference on Microcontrollers

Abstract

Convolutional Neural Networks (CNNs) have been established as the dominant approach to computer vision tasks. As a result, efficient inference of CNNs has become a major concern to enable the processing of image data close to where it is generated by camera sensors, most commonly microcontroller units (MCUs). However, major obstacles to deploying CNNs on MCUs are the strict memory and bandwidth constraints that make processing high-resolution images on many MCUs infeasible. In this work, we propose a method to fuse convolutional layers in quantized CNNs, which can serve as an additional dimension for optimizing the memory requirements of CNNs during inference. By fusing memory-intensive convolutions in the early inverted residual blocks of MobileNetv2-like CNNs, we show that memory requirements during inference can be reduced by up to 54\% at the cost of only about a 14\% increase in latency and no change in accuracy. As an example, we show that this reduction enables the deployment of image processing pipelines on a Cortex-M7 MCU that supports image resolutions up to $320\times320$ pixels compared to the $128\times128$ pixels resolution commonly used in related work.

Cite

Text

Deutel et al. "Fused-Layer CNNs for Memory-Efficient Inference on Microcontrollers." NeurIPS 2024 Workshops: Compression, 2024.

Markdown

[Deutel et al. "Fused-Layer CNNs for Memory-Efficient Inference on Microcontrollers." NeurIPS 2024 Workshops: Compression, 2024.](https://mlanthology.org/neuripsw/2024/deutel2024neuripsw-fusedlayer/)

BibTeX

@inproceedings{deutel2024neuripsw-fusedlayer,
  title     = {{Fused-Layer CNNs for Memory-Efficient Inference on Microcontrollers}},
  author    = {Deutel, Mark and Hannig, Frank and Mutschler, Christopher and Teich, Jürgen},
  booktitle = {NeurIPS 2024 Workshops: Compression},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/deutel2024neuripsw-fusedlayer/}
}