Dense Vision Transformer Compression with Few Samples

Abstract

Few-shot model compression aims to compress a large model into a more compact one with only a tiny training set (even without labels). Block-level pruning has recently emerged as a leading technique in achieving high accuracy and low latency in few-shot CNN compression. But few-shot compression for Vision Transformers (ViT) remains largely unexplored which presents a new challenge. In particular the issue of sparse compression exists in traditional CNN few-shot methods which can only produce very few compressed models of different model sizes. This paper proposes a novel framework for few-shot ViT compression named DC-ViT. Instead of dropping the entire block DC-ViT selectively eliminates the attention module while retaining and reusing portions of the MLP module. DC-ViT enables dense compression which outputs numerous compressed models that densely populate the range of model complexity. DC-ViT outperforms state-of-the-art few-shot compression methods by a significant margin of 10 percentage points along with lower latency in the compression of ViT and its variants.

Cite

Text

Zhang et al. "Dense Vision Transformer Compression with Few Samples." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01498

Markdown

[Zhang et al. "Dense Vision Transformer Compression with Few Samples." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/zhang2024cvpr-dense/) doi:10.1109/CVPR52733.2024.01498

BibTeX

@inproceedings{zhang2024cvpr-dense,
  title     = {{Dense Vision Transformer Compression with Few Samples}},
  author    = {Zhang, Hanxiao and Zhou, Yifan and Wang, Guo-Hua},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {15825-15834},
  doi       = {10.1109/CVPR52733.2024.01498},
  url       = {https://mlanthology.org/cvpr/2024/zhang2024cvpr-dense/}
}