What's in the Image? a Deep-Dive into the Vision of Vision Language Models

Abstract

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content. However, the mechanisms underlying how VLMs process visual information remain largely unexplored. In this paper, we conduct a thorough empirical analysis, focusing on the attention modules across layers, by which we reveal several key insights about how these models process visual data: (i) the internal representation of the query tokens (e.g., representations of "describe the image"), is utilized by the model to store global image information; we demonstrate that the model generates surprisingly descriptive responses solely from these tokens, without direct access to image tokens. (ii) Cross-modal information flow is predominantly influenced by the middle layers (approximately 25% of all layers), while early and late layers contribute only marginally. (iii) Fine-grained visual attributes and object details are directly extracted from image tokens in a spatially localized manner, i.e., the generated tokens associated with a specific object or attribute attend strongly to their corresponding regions in the image. We propose novel quantitative evaluation to validate our observations, leveraging real-world complex visual scenes. Finally, we demonstrate the potential of our findings in facilitating efficient visual processing in state-of-the-art VLMs.

Cite

Text

Kaduri et al. "What's in the Image? a Deep-Dive into the Vision of Vision Language Models." Conference on Computer Vision and Pattern Recognition, 2025.

Markdown

[Kaduri et al. "What's in the Image? a Deep-Dive into the Vision of Vision Language Models." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/kaduri2025cvpr-image/)

BibTeX

@inproceedings{kaduri2025cvpr-image,
  title     = {{What's in the Image? a Deep-Dive into the Vision of Vision Language Models}},
  author    = {Kaduri, Omri and Bagon, Shai and Dekel, Tali},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {14549-14558},
  url       = {https://mlanthology.org/cvpr/2025/kaduri2025cvpr-image/}
}