Aligning Vision Language Models with Contrastive Learning
Abstract
In recent years, Vision Language Models (VLMs) have achieved significant advancements due to the success of large language models. The common strategy for aligning vision and language models involves a two-step process: an alignment (or pretraining) stage and an instruction tuning stage. During the alignment stage, a projection module is trained to map image embeddings into the language space using a paired image-text dataset. In the instruction tuning stage, the model is trained to answer specific questions about the images. In this work, we focus on the alignment stage and identify a significant gap between the embeddings for image and text pairs when VLMs are trained with next-token prediction loss. To address this issue, we employ a contrastive training strategy similar to that used by Radford et al. [ 38 ] along with next token prediction based training. Our findings indicate that this joint pretraining method enhances VLM performance by approximately 2 $\%$ % across various multimodal evaluations without any additional compute or training data. To assess the robustness and generalizability of joint training, we experimented with multiple large language models and observed similar performance improvements. Furthermore, we explore the importance of prompts in contrastive training with various LLM options. We also provide a detailed analysis of the type of vision encoder, projection layer, and LLM to use with the proposed joint training approach.
Cite
Text
Ak et al. "Aligning Vision Language Models with Contrastive Learning." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-91672-4_3Markdown
[Ak et al. "Aligning Vision Language Models with Contrastive Learning." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/ak2024eccvw-aligning/) doi:10.1007/978-3-031-91672-4_3BibTeX
@inproceedings{ak2024eccvw-aligning,
title = {{Aligning Vision Language Models with Contrastive Learning}},
author = {Ak, Kenan E. and Mohta, Jay and Dimitriadis, Dimitris and Manchanda, Saurav and Xu, Yan and Shen, Mingwei},
booktitle = {European Conference on Computer Vision Workshops},
year = {2024},
pages = {32-45},
doi = {10.1007/978-3-031-91672-4_3},
url = {https://mlanthology.org/eccvw/2024/ak2024eccvw-aligning/}
}