Leveraging per Image-Token Consistency for Vision-Language Pre-Training

Abstract

Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Under-utilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations. To handle those limitations, we propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training). In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a language model (i.e., Inconsistent Token Generation Procedure), and then the model is required to determine for each token in the sentence whether it is consistent with the image (i.e., Image-Token Consistency Task). The proposed EPIC method is easily combined with pre-training methods. Extensive experiments show that the combination of the EPIC method and state-of-the-art pre-training approaches, including ViLT, ALBEF, METER, and X-VLM, leads to significant improvements on downstream tasks. Our coude is released at https://github.com/gyhdog99/epic

Cite

Text

Gou et al. "Leveraging per Image-Token Consistency for Vision-Language Pre-Training." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01836

Markdown

[Gou et al. "Leveraging per Image-Token Consistency for Vision-Language Pre-Training." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/gou2023cvpr-leveraging/) doi:10.1109/CVPR52729.2023.01836

BibTeX

@inproceedings{gou2023cvpr-leveraging,
  title     = {{Leveraging per Image-Token Consistency for Vision-Language Pre-Training}},
  author    = {Gou, Yunhao and Ko, Tom and Yang, Hansi and Kwok, James and Zhang, Yu and Wang, Mingxuan},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {19155-19164},
  doi       = {10.1109/CVPR52729.2023.01836},
  url       = {https://mlanthology.org/cvpr/2023/gou2023cvpr-leveraging/}
}