Should VLMs Be Pre-Trained with Image Data?
Abstract
Pre-trained LLMs that are further trained with image data perform well on vision-language tasks. While adding images during a second training phase effectively unlocks this capability, it is unclear how much of a gain or loss this two-step pipeline gives over VLMs which integrate images earlier into the training process. To investigate this, we train models spanning various datasets, scales, image-text ratios, and amount of pre-training done before introducing vision tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks while maintaining strong performance on text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80\% of the way through pre-training results in a 2\% average improvement over introducing visual tokens to a fully pre-trained model.
Cite
Text
Keh et al. "Should VLMs Be Pre-Trained with Image Data?." International Conference on Learning Representations, 2025.Markdown
[Keh et al. "Should VLMs Be Pre-Trained with Image Data?." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/keh2025iclr-vlms/)BibTeX
@inproceedings{keh2025iclr-vlms,
title = {{Should VLMs Be Pre-Trained with Image Data?}},
author = {Keh, Sedrick and Mercat, Jean and Gadre, Samir Yitzhak and Arora, Kushal and Vasiljevic, Igor and Burchfiel, Benjamin and Song, Shuran and Tedrake, Russ and Kollar, Thomas and Schmidt, Ludwig and Dave, Achal},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/keh2025iclr-vlms/}
}