UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training

Abstract

Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC^2, the first machine translation-augmented framework for cross-lingual cross-modal representation learning. To tackle the scarcity problem of multilingual captions for image datasets, we first augment existing English-only datasets with other languages via machine translation (MT). Then we extend the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (eg. using image as pivot). To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Maksed Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated data. Evaluation on multilingual image-text retrieval and multilingual visual question answering benchmarks demonstrates that our proposed framework achieves new state of the art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.

Cite

Text

Zhou et al. "UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00414

Markdown

[Zhou et al. "UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/zhou2021cvpr-uc2/) doi:10.1109/CVPR46437.2021.00414

BibTeX

@inproceedings{zhou2021cvpr-uc2,
  title     = {{UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training}},
  author    = {Zhou, Mingyang and Zhou, Luowei and Wang, Shuohang and Cheng, Yu and Li, Linjie and Yu, Zhou and Liu, Jingjing},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {4155-4165},
  doi       = {10.1109/CVPR46437.2021.00414},
  url       = {https://mlanthology.org/cvpr/2021/zhou2021cvpr-uc2/}
}