Layer-Wise Neural Network Compression via Layer Fusion
Abstract
This paper proposes \textit{layer fusion} - a model compression technique that discovers which weights to combine and then fuses weights of similar fully-connected, convolutional and attention layers. Layer fusion can significantly reduce the number of layers of the original network with little additional computation overhead, while maintaining competitive performance. From experiments on CIFAR-10, we find that various deep convolution neural networks can remain within 2% accuracy points of the original networks up to a compression ratio of 3.33 when iteratively retrained with layer fusion. For experiments on the WikiText-2 language modelling dataset, we compress Transformer models to 20% of their original size while being within 5 perplexity points of the original network. We also find that other well-established compression techniques can achieve competitive performance when compared to their original networks given a sufficient number of retraining steps. Generally, we observe a clear inflection point in performance as the amount of compression increases, suggesting a bound on the amount of compression that can be achieved before an exponential degradation in performance.
Cite
Text
O’Neill et al. "Layer-Wise Neural Network Compression via Layer Fusion." Proceedings of The 13th Asian Conference on Machine Learning, 2021.Markdown
[O’Neill et al. "Layer-Wise Neural Network Compression via Layer Fusion." Proceedings of The 13th Asian Conference on Machine Learning, 2021.](https://mlanthology.org/acml/2021/oneill2021acml-layerwise/)BibTeX
@inproceedings{oneill2021acml-layerwise,
title = {{Layer-Wise Neural Network Compression via Layer Fusion}},
author = {O’Neill, James and Steeg, Greg V. and Galstyan, Aram},
booktitle = {Proceedings of The 13th Asian Conference on Machine Learning},
year = {2021},
pages = {1381-1396},
volume = {157},
url = {https://mlanthology.org/acml/2021/oneill2021acml-layerwise/}
}