Understanding the Robustness in Vision Transformers
Abstract
Recent studies show that Vision Transformers (ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of an explanatory framework towards a more systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of self-attention in visual grouping which indicate that self-attention could promote improved mid-level representation and robustness. We thus propose a family of fully attentional networks (FANs) that incorporate self-attention in both token mixing and channel processing. We validate the design comprehensively on various hierarchical backbones. Our model with a DeiT architecture achieves a state-of-the-art 47.6% mCE on ImageNet-C with 29M parameters. We also demonstrate significantly improved robustness in two downstream tasks: semantic segmentation and object detection
Cite
Text
Zhou et al. "Understanding the Robustness in Vision Transformers." International Conference on Machine Learning, 2022.Markdown
[Zhou et al. "Understanding the Robustness in Vision Transformers." International Conference on Machine Learning, 2022.](https://mlanthology.org/icml/2022/zhou2022icml-understanding/)BibTeX
@inproceedings{zhou2022icml-understanding,
title = {{Understanding the Robustness in Vision Transformers}},
author = {Zhou, Daquan and Yu, Zhiding and Xie, Enze and Xiao, Chaowei and Anandkumar, Animashree and Feng, Jiashi and Alvarez, Jose M.},
booktitle = {International Conference on Machine Learning},
year = {2022},
pages = {27378-27394},
volume = {162},
url = {https://mlanthology.org/icml/2022/zhou2022icml-understanding/}
}