Probing Visual Language Priors in VLMs
Abstract
Vision-Language Models (VLMs) may over-rely on visual language priors from their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4o achieves only 66.17% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data and then apply pixel-level and semantic corruptions to form “good-bad" image pairs for self-training. Our proposed training objective, Image-DPO, compels VLMs to focus more on the actual visual inputs, and we demonstrate its effectiveness in LLaVA-v1.5 and Cambrian. Project Page: https://vilp-team.github.io/.
Cite
Text
Luo et al. "Probing Visual Language Priors in VLMs." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Luo et al. "Probing Visual Language Priors in VLMs." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/luo2025icml-probing/)BibTeX
@inproceedings{luo2025icml-probing,
title = {{Probing Visual Language Priors in VLMs}},
author = {Luo, Tiange and Cao, Ang and Lee, Gunhee and Johnson, Justin and Lee, Honglak},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {41120-41156},
volume = {267},
url = {https://mlanthology.org/icml/2025/luo2025icml-probing/}
}