Dynamic Pretraining of Vision-Language Models
Abstract
Vision-Language pretraining aims to learn universal cross-modal representations and to create models with broad capabilities. While most models have taken the direction of scaling training to increasingly large models and datasets, in this paper, we propose a dynamic pretraining resampling approach which utilizes a variety of pretraining tasks, and which results in more sample-efficient models. We show that a set of diverse self- and weakly-supervised pretraining tasks dynamically sampled according to task difficulty provides strong performance. We show that a single 330M param pretrained model using only smaller and publicly accessible image-language datasets, achieves competitive or SOTA performance on three diverse groups of tasks: visual question answering, text-based image localization by referring expressions, and video question answering.
Cite
Text
Piergiovanni et al. "Dynamic Pretraining of Vision-Language Models." ICLR 2023 Workshops: MRL, 2023.Markdown
[Piergiovanni et al. "Dynamic Pretraining of Vision-Language Models." ICLR 2023 Workshops: MRL, 2023.](https://mlanthology.org/iclrw/2023/piergiovanni2023iclrw-dynamic/)BibTeX
@inproceedings{piergiovanni2023iclrw-dynamic,
title = {{Dynamic Pretraining of Vision-Language Models}},
author = {Piergiovanni, Aj and Kuo, Weicheng and Li, Wei and Angelova, Anelia},
booktitle = {ICLR 2023 Workshops: MRL},
year = {2023},
url = {https://mlanthology.org/iclrw/2023/piergiovanni2023iclrw-dynamic/}
}