TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-Training

Jiang, Chaoya; Ye, Wei; Xu, Haiyang; Ye, Qinghao; Yan, Ming; Zhang, Ji; Zhang, Shikun

doi:10.1609/AAAI.V38I3.28025

TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-Training

Chaoya Jiang, Wei Ye, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Shikun Zhang

AAAI 2024 pp. 2489-2497

doi:10.1609/AAAI.V38I3.28025 /aaai/2024/jiang2024aaai-timix/

Abstract

Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMix from a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios. Our code is available on https://github.com/chaoyajiang/TiMiX/tree/main.

PDF AAAI Semantic Scholar

Cite

Text

Jiang et al. "TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-Training." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I3.28025

Markdown

[Jiang et al. "TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-Training." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/jiang2024aaai-timix/) doi:10.1609/AAAI.V38I3.28025

BibTeX

@inproceedings{jiang2024aaai-timix,
  title     = {{TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-Training}},
  author    = {Jiang, Chaoya and Ye, Wei and Xu, Haiyang and Ye, Qinghao and Yan, Ming and Zhang, Ji and Zhang, Shikun},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {2489-2497},
  doi       = {10.1609/AAAI.V38I3.28025},
  url       = {https://mlanthology.org/aaai/2024/jiang2024aaai-timix/}
}