Rethinking the Self-Attention in Vision Transformers

Abstract

Self-attention is a corner stone for transformer models. However, our analysis shows that self-attention in vision transformer inference is extremely sparse. When applying a sparsity constraint, our experiments on image (ImageNet- 1K) and video (Kinetics-400) understanding show we can achieve 95% sparsity on the self-attention maps while main-taining the performance drop to be less than 2 points. This motivates us to rethink the role of self-attention in vision transformer models.

Cite

Text

Kim et al. "Rethinking the Self-Attention in Vision Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021. doi:10.1109/CVPRW53098.2021.00342

Markdown

[Kim et al. "Rethinking the Self-Attention in Vision Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021.](https://mlanthology.org/cvprw/2021/kim2021cvprw-rethinking/) doi:10.1109/CVPRW53098.2021.00342

BibTeX

@inproceedings{kim2021cvprw-rethinking,
  title     = {{Rethinking the Self-Attention in Vision Transformers}},
  author    = {Kim, Kyungmin and Wu, Bichen and Dai, Xiaoliang and Zhang, Peizhao and Yan, Zhicheng and Vajda, Peter and Kim, Seon Joo},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2021},
  pages     = {3071-3075},
  doi       = {10.1109/CVPRW53098.2021.00342},
  url       = {https://mlanthology.org/cvprw/2021/kim2021cvprw-rethinking/}
}