Visual Transformers: Where Do Transformers Really Belong in Vision Models?

Abstract

A recent trend in computer vision is to replace convolutions with transformers. However, the performance gain of transformers is attained at a steep cost, requiring GPU years and hundreds of millions of samples for training. This excessive resource usage compensates for a misuse of transformers: Transformers densely model relationships between its inputs -- ideal for late stages of a neural network, when concepts are sparse and spatially-distant, but extremely inefficient for early stages of a network, when patterns are redundant and localized. To address these issues, we leverage the respective strengths of both operations, building convolution-transformer hybrids. Critically, in sharp contrast to pixel-space transformers, our Visual Transformer (VT) operates in a semantic token space, judiciously attending to different image parts based on context. Our VTs significantly outperforms baselines: On ImageNet, our VT-ResNets outperform convolution-only ResNet by 4.6 to 7 points and transformer-only ViT-B by 2.6 points with 2.5 times fewer FLOPs, 2.1 times fewer parameters. For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.

Cite

Text

Wu et al. "Visual Transformers: Where Do Transformers Really Belong in Vision Models?." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00064

Markdown

[Wu et al. "Visual Transformers: Where Do Transformers Really Belong in Vision Models?." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/wu2021iccv-visual/) doi:10.1109/ICCV48922.2021.00064

BibTeX

@inproceedings{wu2021iccv-visual,
  title     = {{Visual Transformers: Where Do Transformers Really Belong in Vision Models?}},
  author    = {Wu, Bichen and Xu, Chenfeng and Dai, Xiaoliang and Wan, Alvin and Zhang, Peizhao and Yan, Zhicheng and Tomizuka, Masayoshi and Gonzalez, Joseph E. and Keutzer, Kurt and Vajda, Peter},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {599-609},
  doi       = {10.1109/ICCV48922.2021.00064},
  url       = {https://mlanthology.org/iccv/2021/wu2021iccv-visual/}
}