HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer

Abstract

There has been a debate on the choice of plain vs. hierarchical vision transformers, where researchers often believe that the former (e.g., ViT) has a simpler design but the latter (e.g., Swin) enjoys higher recognition accuracy. Recently, the emerge of masked image modeling (MIM), a self-supervised visual pre-training method, raised a new challenge to vision transformers in terms of flexibility, i.e., part of image patches or tokens are to be discarded, which seems to claim the advantages of plain vision transformers. In this paper, we delve deep into the comparison between ViT and Swin, revealing that (i) the performance gain of Swin is mainly brought by a deepened backbone and relative positional encoding, (ii) the hierarchical design of Swin can be simplified into hierarchical patch embedding (proposed in this work), and (iii) other designs such as shifted-window attentions can be removed. By removing the unnecessary operations, we come up with a new architecture named HiViT (short for hierarchical ViT), which is simpler and more efficient than Swin yet further improves its performance on fully-supervised and self-supervised visual representation learning. In particular, after pre-trained using masked autoencoder (MAE) on ImageNet-1K, HiViT-B reports a 84.6% accuracy on ImageNet-1K classification, a 53.3% box AP on COCO detection, and a 52.8% mIoU on ADE20K segmentation, significantly surpassing the baseline. Code is available at https://github.com/zhangxiaosong18/hivit.

Cite

Text

Zhang et al. "HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer." International Conference on Learning Representations, 2023.

Markdown

[Zhang et al. "HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/zhang2023iclr-hivit/)

BibTeX

@inproceedings{zhang2023iclr-hivit,
  title     = {{HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer}},
  author    = {Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
  booktitle = {International Conference on Learning Representations},
  year      = {2023},
  url       = {https://mlanthology.org/iclr/2023/zhang2023iclr-hivit/}
}