VladVA: Discriminative Fine-Tuning of LVLMs

Abstract

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

Cite

Text

Ouali et al. "VladVA: Discriminative Fine-Tuning of LVLMs." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00388

Markdown

[Ouali et al. "VladVA: Discriminative Fine-Tuning of LVLMs." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/ouali2025cvpr-vladva/) doi:10.1109/CVPR52734.2025.00388

BibTeX

@inproceedings{ouali2025cvpr-vladva,
  title     = {{VladVA: Discriminative Fine-Tuning of LVLMs}},
  author    = {Ouali, Yassine and Bulat, Adrian and Xenos, Alexandros and Zaganidis, Anestis and Metaxas, Ioannis Maniadis and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {4101-4111},
  doi       = {10.1109/CVPR52734.2025.00388},
  url       = {https://mlanthology.org/cvpr/2025/ouali2025cvpr-vladva/}
}