SyncMask: Synchronized Attentional Masking for Fashion-Centric Vision-Language Pretraining
Abstract
Vision-language models (VLMs) have made significant strides in cross-modal understanding through large-scale paired datasets. However in fashion domain datasets often exhibit a disparity between the information conveyed in image and text. This issue stems from datasets containing multiple images of a single fashion item all paired with one text leading to cases where some textual details are not visible in individual images. This mismatch particularly when non-co-occurring elements are masked undermines the training of conventional VLM objectives like Masked Language Modeling and Masked Image Modeling thereby hindering the model's ability to accurately align fine-grained visual and textual features. Addressing this problem we propose Synchronized attentional Masking (SyncMask) which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text. This synchronization is accomplished by harnessing cross-attentional features obtained from a momentum model ensuring a precise alignment between the two modalities. Additionally we enhance grouped batch sampling with semi-hard negatives effectively mitigating false negative issues in Image-Text Matching and Image-Text Contrastive learning objectives within fashion datasets. Our experiments demonstrate the effectiveness of the proposed approach outperforming existing methods in three downstream tasks.
Cite
Text
Song et al. "SyncMask: Synchronized Attentional Masking for Fashion-Centric Vision-Language Pretraining." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01323Markdown
[Song et al. "SyncMask: Synchronized Attentional Masking for Fashion-Centric Vision-Language Pretraining." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/song2024cvpr-syncmask/) doi:10.1109/CVPR52733.2024.01323BibTeX
@inproceedings{song2024cvpr-syncmask,
title = {{SyncMask: Synchronized Attentional Masking for Fashion-Centric Vision-Language Pretraining}},
author = {Song, Chull Hwan and Hwang, Taebaek and Yoon, Jooyoung and Choi, Shunghyun and Gu, Yeong Hyeon},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {13948-13957},
doi = {10.1109/CVPR52733.2024.01323},
url = {https://mlanthology.org/cvpr/2024/song2024cvpr-syncmask/}
}