ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer

Abstract

The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational dimensions. Such inflexibility restricts it from possessing context-oriented generalization that can bring more contextual cues and graphic representations. To mitigate this issue, we propose a Scalable Self-Attention (SSA) mechanism that leverages two scaling factors to release dimensions of query, key, and value matrices while unbinding them with the input. This scalability fetches context-oriented generalization and enhances object sensitivity, which pushes the whole network into a more effective trade-off state between precision and cost. Furthermore, we propose an Interactive Window-based Self-Attention (IWSA), which establishes interaction between non-overlapping regions by re-merging independent value tokens and aggregating spatial information from adjacent windows. By stacking the SSA and IWSA alternately, the Scalable Vision Transformer (ScalableViT) achieves state-of-the-art performance in general-purpose vision tasks. For example, ScalableViT-S outperforms Twins-SVT-S by 1.4% and Swin-T by 1.8% on ImageNet-1K classification.

Cite

Text

Yang et al. "ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-20053-3_28

Markdown

[Yang et al. "ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/yang2022eccv-scalablevit/) doi:10.1007/978-3-031-20053-3_28

BibTeX

@inproceedings{yang2022eccv-scalablevit,
  title     = {{ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer}},
  author    = {Yang, Rui and Ma, Hailong and Wu, Jie and Tang, Yansong and Xiao, Xuefeng and Zheng, Min and Li, Xiu},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-20053-3_28},
  url       = {https://mlanthology.org/eccv/2022/yang2022eccv-scalablevit/}
}