Investigating Compositional Challenges in Vision-Language Models for Visual Grounding

Abstract

Pre-trained vision-language models (VLMs) have achieved high performance on various downstream tasks which have been widely used for visual grounding tasks in a weakly supervised manner. However despite the performance gains contributed by large vision and language pre-training we find that state-of-the-art VLMs struggle with compositional reasoning on grounding tasks. To demonstrate this we propose Attribute Relation and Priority grounding (ARPGrounding) benchmark to test VLMs' compositional reasoning ability on visual grounding tasks. ARPGrounding contains 11425 samples and evaluates the compositional understanding of VLMs in three dimensions: 1) attribute denoting comprehension of objects' properties; 2) relation indicating an understanding of relation between objects; 3) priority reflecting an awareness of the part of speech associated with nouns. Using the ARPGrounding benchmark we evaluate several mainstream VLMs. We empirically find that these models perform quite well on conventional visual grounding datasets achieving performance comparable to or surpassing state-of-the-art methods but showing strong deficiencies in compositional reasoning. Furthermore we propose a composition-aware fine-tuning pipeline demonstrating the potential to leverage cost-effective image-text annotations for enhancing the compositional understanding of VLMs in grounding tasks.

Cite

Text

Zeng et al. "Investigating Compositional Challenges in Vision-Language Models for Visual Grounding." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01341

Markdown

[Zeng et al. "Investigating Compositional Challenges in Vision-Language Models for Visual Grounding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/zeng2024cvpr-investigating/) doi:10.1109/CVPR52733.2024.01341

BibTeX

@inproceedings{zeng2024cvpr-investigating,
  title     = {{Investigating Compositional Challenges in Vision-Language Models for Visual Grounding}},
  author    = {Zeng, Yunan and Huang, Yan and Zhang, Jinjin and Jie, Zequn and Chai, Zhenhua and Wang, Liang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {14141-14151},
  doi       = {10.1109/CVPR52733.2024.01341},
  url       = {https://mlanthology.org/cvpr/2024/zeng2024cvpr-investigating/}
}