Training Vision-Language Transformers from Captions

Abstract

Vision-Language Transformers can be learned without low-level human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes (Chen et al., 2020b; Tan & Bansal, 2019; Lu et al., 2019) or patches (Kim et al., 2021), assumes that the visual backbone must first be trained on ImageNet (Russakovsky et al., 2015) class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders (He et al., 2022) that does not require this supervision. We seek to provide general advice on multimodal pretraining by examining the roles of (a) unimodal initialization, (b) unimodal architectural components and (c) data annotation in the pretraining corpus. Our extensive and carefully controlled studies suggest that none of the above factors is absolutely important in achieving versatile vision-language representations. We conclude our analysis with suggestions on the choices of initialization, architectural components, and annotation formats targeting a better balance between data efficiency and representation quality.

Cite

Text

Gui et al. "Training Vision-Language Transformers from Captions." Transactions on Machine Learning Research, 2023.

Markdown

[Gui et al. "Training Vision-Language Transformers from Captions." Transactions on Machine Learning Research, 2023.](https://mlanthology.org/tmlr/2023/gui2023tmlr-training/)

BibTeX

@article{gui2023tmlr-training,
  title     = {{Training Vision-Language Transformers from Captions}},
  author    = {Gui, Liangke and Chang, Yingshan and Huang, Qiuyuan and Som, Subhojit and Hauptmann, Alexander G and Gao, Jianfeng and Bisk, Yonatan},
  journal   = {Transactions on Machine Learning Research},
  year      = {2023},
  url       = {https://mlanthology.org/tmlr/2023/gui2023tmlr-training/}
}