Understanding the Robustness in Vision Transformers

Abstract

Recent studies show that Vision Transformers (ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of an explanatory framework towards a more systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of self-attention in visual grouping which indicate that self-attention could promote improved mid-level representation and robustness. We thus propose a family of fully attentional networks (FANs) that incorporate self-attention in both token mixing and channel processing. We validate the design comprehensively on various hierarchical backbones. Our model with a DeiT architecture achieves a state-of-the-art 47.6% mCE on ImageNet-C with 29M parameters. We also demonstrate significantly improved robustness in two downstream tasks: semantic segmentation and object detection

Cite

Text

Zhou et al. "Understanding the Robustness in Vision Transformers." International Conference on Machine Learning, 2022.

Markdown

[Zhou et al. "Understanding the Robustness in Vision Transformers." International Conference on Machine Learning, 2022.](https://mlanthology.org/icml/2022/zhou2022icml-understanding/)

BibTeX

@inproceedings{zhou2022icml-understanding,
  title     = {{Understanding the Robustness in Vision Transformers}},
  author    = {Zhou, Daquan and Yu, Zhiding and Xie, Enze and Xiao, Chaowei and Anandkumar, Animashree and Feng, Jiashi and Alvarez, Jose M.},
  booktitle = {International Conference on Machine Learning},
  year      = {2022},
  pages     = {27378-27394},
  volume    = {162},
  url       = {https://mlanthology.org/icml/2022/zhou2022icml-understanding/}
}