Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Xu, Zhenlin; Zhu, Yi; Deng, Siqi; Mittal, Abhay; Chen, Yanbei; Wang, Manchen; Favaro, Paolo; Tighe, Joseph; Modolo, Davide

doi:10.1109/CVPRW63382.2024.00189

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Zhenlin Xu, Yi Zhu, Siqi Deng, Abhay Mittal, Yanbei Chen, Manchen Wang, Paolo Favaro, Joseph Tighe, Davide Modolo

CVPRW 2024 pp. 1827-1836

doi:10.1109/CVPRW63382.2024.00189 /cvprw/2024/xu2024cvprw-benchmarking/

Abstract

This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition, focusing on granularity and specificity. Although VLMs excel in tasks like image captioning, they face challenges in open-world settings. Our benchmarks test VLMs’ consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Findings show that VLMs favor moderately fine-grained concepts and struggle with specificity, often misjudging texts that differ from their training data. Extensive evaluations reveal limitations in current VLMs, particularly in distinguishing between correct and subtly incorrect descriptions. While fine-tuning offers some improvements, it doesn’t fully address these issues, highlighting the need for VLMs with enhanced generalization capabilities for real-world applications. This study provides insights into VLM limitations and suggests directions for developing more robust models.

PDF CVPRW Semantic Scholar

Cite

Text

Xu et al. "Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00189

Markdown

[Xu et al. "Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/xu2024cvprw-benchmarking/) doi:10.1109/CVPRW63382.2024.00189

BibTeX

@inproceedings{xu2024cvprw-benchmarking,
  title     = {{Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity}},
  author    = {Xu, Zhenlin and Zhu, Yi and Deng, Siqi and Mittal, Abhay and Chen, Yanbei and Wang, Manchen and Favaro, Paolo and Tighe, Joseph and Modolo, Davide},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {1827-1836},
  doi       = {10.1109/CVPRW63382.2024.00189},
  url       = {https://mlanthology.org/cvprw/2024/xu2024cvprw-benchmarking/}
}