Efficient Deep Learning Inference Based on Model Compression
Abstract
Deep neural networks (DNNs) have evolved remarkably over the last decade and achieved great success in many machine learning tasks. Along the evolution of deep learning (DL) methods, computational complexity and resource consumption of DL models continue to increase, this makes efficient deployment challenging, especially in devices with low memory resources or in applications with strict latency requirements. In this paper, we will introduce a DL inference optimization pipeline, which consists of a series of model compression methods, including Tensor Decomposition (TD), Graph Adaptive Pruning (GAP), Intrinsic Sparse Structures (ISS) in Long Short-Term Memory (LSTM), Knowledge Distillation (KD) and low-bit model quantization. We use different modeling scenarios to test our inference optimization pipeline with above mentioned methods, and it shows promising results to make inference more efficient with marginal loss of model accuracy.
Cite
Text
Zhang et al. "Efficient Deep Learning Inference Based on Model Compression." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018. doi:10.1109/CVPRW.2018.00221Markdown
[Zhang et al. "Efficient Deep Learning Inference Based on Model Compression." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018.](https://mlanthology.org/cvprw/2018/zhang2018cvprw-efficient/) doi:10.1109/CVPRW.2018.00221BibTeX
@inproceedings{zhang2018cvprw-efficient,
title = {{Efficient Deep Learning Inference Based on Model Compression}},
author = {Zhang, Qing and Zhang, Mengru and Wang, Mengdi and Sui, Wanchen and Meng, Chen and Yang, Jun and Kong, Weidan and Cui, Xiaoyuan and Lin, Wei},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2018},
pages = {1695-1702},
doi = {10.1109/CVPRW.2018.00221},
url = {https://mlanthology.org/cvprw/2018/zhang2018cvprw-efficient/}
}