Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei

CVPR 2023 pp. 19175-19186

doi:10.1109/CVPR52729.2023.01838 /cvpr/2023/wang2023cvpr-image/

Abstract

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves excellent transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We use Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains remarkable performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).

PDF CVPR Semantic Scholar

Cite

Text

Wang et al. "Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01838

Markdown

[Wang et al. "Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/wang2023cvpr-image/) doi:10.1109/CVPR52729.2023.01838

BibTeX

@inproceedings{wang2023cvpr-image,
  title     = {{Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks}},
  author    = {Wang, Wenhui and Bao, Hangbo and Dong, Li and Bjorck, Johan and Peng, Zhiliang and Liu, Qiang and Aggarwal, Kriti and Mohammed, Owais Khan and Singhal, Saksham and Som, Subhojit and Wei, Furu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {19175-19186},
  doi       = {10.1109/CVPR52729.2023.01838},
  url       = {https://mlanthology.org/cvpr/2023/wang2023cvpr-image/}
}