Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

Abstract

We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e.g., only 25% 50% of the input embeddings. (ii) In order to construct multi-scale representations for object detection from single-scale ViT, a randomly initialized compact convolutional stem supplants the pre-trained patchify stem, and its intermediate features can naturally serve as the higher resolution inputs of a feature pyramid network without further upsampling or other manipulations. While the pre-trained ViT is only regarded as the third-stage of our detector's backbone instead of the whole feature extractor. This naturally results in a ConvNet-ViT hybrid architecture. The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform leading hierarchical architectures such as Swin Transformer, MViTv2 and ConvNeXt on COCO object detection & instance segmentation, and achieves better results compared with the previous best adapted vanilla ViT detector using a more modest fine-tuning recipe while converging 2.8x faster. Code and pre-trained models are available at https://github.com/hustvl/MIMDet.

Cite

Text

Fang et al. "Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00574

Markdown

[Fang et al. "Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/fang2023iccv-unleashing/) doi:10.1109/ICCV51070.2023.00574

BibTeX

@inproceedings{fang2023iccv-unleashing,
  title     = {{Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection}},
  author    = {Fang, Yuxin and Yang, Shusheng and Wang, Shijie and Ge, Yixiao and Shan, Ying and Wang, Xinggang},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {6244-6253},
  doi       = {10.1109/ICCV51070.2023.00574},
  url       = {https://mlanthology.org/iccv/2023/fang2023iccv-unleashing/}
}