Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space

Abstract

This paper explores the feasibility of finding an optimal sub-model from a vision transformer and introduces a pure vision transformer slimming (ViT-Slim) framework. It can search a sub-structure from the original model end-to-end across multiple dimensions, including the input tokens, MHSA and MLP modules with state-of-the-art performance. Our method is based on a learnable and unified l1 sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions. The searching process is highly efficient through a single-shot training scheme. For instance, on DeiT-S, ViT-Slim only takes 43 GPU hours for the searching process, and the searched structure is flexible with diverse dimensionalities in different modules. Then, a budget threshold is employed according to the requirements of accuracy-FLOPs trade-off on running devices, and a re-training process is performed to obtain the final model. The extensive experiments show that our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by 0.6% on ImageNet. We also demonstrate the advantage of our searched models on several downstream datasets. Our code is available at https://github.com/Arnav0400/ViT-Slim.

Cite

Text

Chavan et al. "Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00488

Markdown

[Chavan et al. "Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/chavan2022cvpr-vision/) doi:10.1109/CVPR52688.2022.00488

BibTeX

@inproceedings{chavan2022cvpr-vision,
  title     = {{Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space}},
  author    = {Chavan, Arnav and Shen, Zhiqiang and Liu, Zhuang and Liu, Zechun and Cheng, Kwang-Ting and Xing, Eric P.},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {4931-4941},
  doi       = {10.1109/CVPR52688.2022.00488},
  url       = {https://mlanthology.org/cvpr/2022/chavan2022cvpr-vision/}
}