Beyond Fixation: Dynamic Window Visual Transformer

Ren, Pengzhen; Li, Changlin; Wang, Guangrun; Xiao, Yun; Du, Qing; Liang, Xiaodan; Chang, Xiaojun

doi:10.1109/CVPR52688.2022.01168

Beyond Fixation: Dynamic Window Visual Transformer

Pengzhen Ren, Changlin Li, Guangrun Wang, Yun Xiao, Qing Du, Xiaodan Liang, Xiaojun Chang

CVPR 2022 pp. 11987-11997

doi:10.1109/CVPR52688.2022.01168 /cvpr/2022/ren2022cvpr-beyond/

Abstract

Recently, a surge of interest in visual transformers is to reduce the computational cost by limiting the calculation of self-attention to a local window. Most current work uses a fixed single-scale window for modeling by default, ignoring the impact of window size on model performance. However, this may limit the modeling potential of these window-based models for multi-scale information. In this paper, we propose a novel method, named Dynamic Window Vision Transformer (DW-ViT). To the best of our knowledge, we are the first to use dynamic multi-scale windows to explore the upper limit of the effect of window settings on model performance. In DW-ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head self-attention. Then, the information is dynamically fused by assigning different weights to the multi-scale window branches. We conducted a detailed performance evaluation on three datasets, ImageNet-1K, ADE20K, and COCO. Compared with related state-of-the-art (SoTA) methods, DW-ViT obtains the best performance. Specifically, compared with the current SoTA Swin Transformers [??], DW-ViT has achieved consistent and substantial improvements on all three datasets with similar parameters and computational costs. In addition, DW-ViT exhibits good scalability and can be easily inserted into any window-based visual transformers.

PDF CVPR Semantic Scholar

Cite

Text

Ren et al. "Beyond Fixation: Dynamic Window Visual Transformer." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01168

Markdown

[Ren et al. "Beyond Fixation: Dynamic Window Visual Transformer." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/ren2022cvpr-beyond/) doi:10.1109/CVPR52688.2022.01168

BibTeX

@inproceedings{ren2022cvpr-beyond,
  title     = {{Beyond Fixation: Dynamic Window Visual Transformer}},
  author    = {Ren, Pengzhen and Li, Changlin and Wang, Guangrun and Xiao, Yun and Du, Qing and Liang, Xiaodan and Chang, Xiaojun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {11987-11997},
  doi       = {10.1109/CVPR52688.2022.01168},
  url       = {https://mlanthology.org/cvpr/2022/ren2022cvpr-beyond/}
}