Multi-Modal Representation Learning with Text-Driven Soft Masks

Abstract

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.

Cite

Text

Park and Han. "Multi-Modal Representation Learning with Text-Driven Soft Masks." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00274

Markdown

[Park and Han. "Multi-Modal Representation Learning with Text-Driven Soft Masks." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/park2023cvpr-multimodal/) doi:10.1109/CVPR52729.2023.00274

BibTeX

@inproceedings{park2023cvpr-multimodal,
  title     = {{Multi-Modal Representation Learning with Text-Driven Soft Masks}},
  author    = {Park, Jaeyoo and Han, Bohyung},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {2798-2807},
  doi       = {10.1109/CVPR52729.2023.00274},
  url       = {https://mlanthology.org/cvpr/2023/park2023cvpr-multimodal/}
}