Dynamic Pretraining of Vision-Language Models

Abstract

Vision-Language pretraining aims to learn universal cross-modal representations and to create models with broad capabilities. While most models have taken the direction of scaling training to increasingly large models and datasets, in this paper, we propose a dynamic pretraining resampling approach which utilizes a variety of pretraining tasks, and which results in more sample-efficient models. We show that a set of diverse self- and weakly-supervised pretraining tasks dynamically sampled according to task difficulty provides strong performance. We show that a single 330M param pretrained model using only smaller and publicly accessible image-language datasets, achieves competitive or SOTA performance on three diverse groups of tasks: visual question answering, text-based image localization by referring expressions, and video question answering.

Cite

Text

Piergiovanni et al. "Dynamic Pretraining of Vision-Language Models." ICLR 2023 Workshops: MRL, 2023.

Markdown

[Piergiovanni et al. "Dynamic Pretraining of Vision-Language Models." ICLR 2023 Workshops: MRL, 2023.](https://mlanthology.org/iclrw/2023/piergiovanni2023iclrw-dynamic/)

BibTeX

@inproceedings{piergiovanni2023iclrw-dynamic,
  title     = {{Dynamic Pretraining of Vision-Language Models}},
  author    = {Piergiovanni, Aj and Kuo, Weicheng and Li, Wei and Angelova, Anelia},
  booktitle = {ICLR 2023 Workshops: MRL},
  year      = {2023},
  url       = {https://mlanthology.org/iclrw/2023/piergiovanni2023iclrw-dynamic/}
}