Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment

Mingyang Zhou, Licheng Yu, Amanpreet Singh, Mengjiao Wang, Zhou Yu, Ning Zhang

CVPR 2022 pp. 16485-16494

doi:10.1109/CVPR52688.2022.01599 /cvpr/2022/zhou2022cvpr-unsupervised/

Abstract

Vision-and-Language (V+L) pre-training models have achieved tremendous success in recent years on various multi-modal benchmarks. However, the majority of existing models require pre-training on a large set of parallel image-text data, which is costly to collect, compared to image-only or text-only data. In this paper, we propose unsupervised Vision-and-Language pre-training (UVLP) to learn the cross-modal representation from non-parallel image and text datasets. We found two key factors that lead to good unsupervised V+L pre-training without parallel data: (i) joint image-and-text input (ii) overall image-text alignment (even for non-parallel data). Accordingly, we propose a novel unsupervised V+L pre-training curriculum for non-parallel texts and images. We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks, including region-to-tag, region-to-phrase, and image-to-sentence alignment, to bridge the gap between the two modalities. A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model. We adapt our pre-trained model to a set of V+L downstream tasks, including VQA, NLVR2, Visual Entailment, and RefCOCO+. Our model achieves the state-of-art performance in all these tasks under the unsupervised setting.

PDF CVPR Semantic Scholar

Cite

Text

Zhou et al. "Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01599

Markdown

[Zhou et al. "Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/zhou2022cvpr-unsupervised/) doi:10.1109/CVPR52688.2022.01599

BibTeX

@inproceedings{zhou2022cvpr-unsupervised,
  title     = {{Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment}},
  author    = {Zhou, Mingyang and Yu, Licheng and Singh, Amanpreet and Wang, Mengjiao and Yu, Zhou and Zhang, Ning},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {16485-16494},
  doi       = {10.1109/CVPR52688.2022.01599},
  url       = {https://mlanthology.org/cvpr/2022/zhou2022cvpr-unsupervised/}
}