Visual Data-Type Understanding Does Not Emerge from Scaling Vision-Language Models

Abstract

Recent advances in the development of vision-language models (VLMs) are yielding remarkable success in recognizing visual semantic content, including impressive instances of compositional image understanding. Here, we introduce the novel task of Visual Data-Type Identification, a basic perceptual skill with implications for data curation (e.g., noisy data-removal from large datasets, domains pecific retrieval) and autonomous vision (e.g., distinguishing changing weather conditions from camera lens staining). We develop two datasets consisting of animal images altered across a diverse set of 27 visual data-types, spanning four broad categories. An extensive zero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced performance landscape. While VLMs are reasonably good at identifying certain stylistic data-types, such as cartoons and sketches, they struggle with simpler data-types arising from basic manipulations like image rotations or additive noise. Our findings reveal that (i) model scaling alone yields marginal gains for contrastively-trained models like CLIP, and (ii) there is a pronounced drop in performance for the largest auto-regressively trained VLMs like OpenFlamingo. This finding points to a blind spot in current frontier VLMs: they excel in recognizing semantic content but fail to acquire an understanding of visual data-types through scaling. By analyzing the pre-training distributions of these models and incorporating data-type information into the captions during fine-tuning, we achieve a significant enhancement in performance. By exploring this previously uncharted task, we aim to set the stage for further advancing VLMs to equip them with visual data-type understanding.

Cite

Text

Udandarao et al. "Visual Data-Type Understanding Does Not Emerge from Scaling Vision-Language Models." International Conference on Learning Representations, 2024.

Markdown

[Udandarao et al. "Visual Data-Type Understanding Does Not Emerge from Scaling Vision-Language Models." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/udandarao2024iclr-visual/)

BibTeX

@inproceedings{udandarao2024iclr-visual,
  title     = {{Visual Data-Type Understanding Does Not Emerge from Scaling Vision-Language Models}},
  author    = {Udandarao, Vishaal and Burg, Max F and Albanie, Samuel and Bethge, Matthias},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/udandarao2024iclr-visual/}
}