Synthesize Diagnose and Optimize: Towards Fine-Grained Vision-Language Understanding

Abstract

Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However understanding fine-grained visual-linguistic concepts such as attributes and inter-object relationships remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity their primary focus remains on the linguistic aspect neglecting the visual dimension. Here we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine we carefully design a benchmark SPEC to diagnose the comprehension of object size position existence and count. Subsequently we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly their performance is close to random guess revealing significant limitations. With this in mind we propose a simple yet effective approach to optimize VLMs in fine-grained understanding achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements further validating the transferability of our approach. Code and data are available at https://github.com/wjpoom/SPEC.

Cite

Text

Peng et al. "Synthesize Diagnose and Optimize: Towards Fine-Grained Vision-Language Understanding." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01261

Markdown

[Peng et al. "Synthesize Diagnose and Optimize: Towards Fine-Grained Vision-Language Understanding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/peng2024cvpr-synthesize/) doi:10.1109/CVPR52733.2024.01261

BibTeX

@inproceedings{peng2024cvpr-synthesize,
  title     = {{Synthesize Diagnose and Optimize: Towards Fine-Grained Vision-Language Understanding}},
  author    = {Peng, Wujian and Xie, Sicheng and You, Zuyao and Lan, Shiyi and Wu, Zuxuan},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {13279-13288},
  doi       = {10.1109/CVPR52733.2024.01261},
  url       = {https://mlanthology.org/cvpr/2024/peng2024cvpr-synthesize/}
}