Seeing What You Miss: Vision-Language Pre-Training with Semantic Completion Learning

Abstract

Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-to-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations. Therefore, in this paper, we propose a novel Semantic Completion Learning (SCL) task, complementary to existing masked modeling tasks, to facilitate global-to-local alignment. Specifically, the SCL task complements the missing semantics of masked data by capturing the corresponding information from the other modality, promoting learning more representative global features which have a great impact on the performance of downstream tasks. Moreover, we present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.

Cite

Text

Ji et al. "Seeing What You Miss: Vision-Language Pre-Training with Semantic Completion Learning." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00656

Markdown

[Ji et al. "Seeing What You Miss: Vision-Language Pre-Training with Semantic Completion Learning." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/ji2023cvpr-seeing/) doi:10.1109/CVPR52729.2023.00656

BibTeX

@inproceedings{ji2023cvpr-seeing,
  title     = {{Seeing What You Miss: Vision-Language Pre-Training with Semantic Completion Learning}},
  author    = {Ji, Yatai and Tu, Rongcheng and Jiang, Jie and Kong, Weijie and Cai, Chengfei and Zhao, Wenzhe and Wang, Hongfa and Yang, Yujiu and Liu, Wei},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {6789-6798},
  doi       = {10.1109/CVPR52729.2023.00656},
  url       = {https://mlanthology.org/cvpr/2023/ji2023cvpr-seeing/}
}