WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models

Abstract

Cross-modal retrieval tasks, such as image-to-text and text-to-image, are crucial for evaluating vision-language models (VLMs). State-of-the-art VLMs like CLIP and BLIP-2 achieve impressive performance on benchmarks such as MSCOCO and Flickr30K. However, due to the high similarity between evaluation datasets (e.g., Flickr30K) and fine-tuning datasets (e.g., MSCOCO), these benchmarks are insufficient for assessing the out-of-distribution (OOD) generalization capabilities of VLMs. We introduce $\textbf{WIKIDO}$ (derived from $\textbf{Wiki}$pedia $\textbf{D}$iversity $\textbf{O}$bservatory), a new benchmark featuring 384K image-text pairs, alongside carefully curated, human-verified in-distribution (ID) and OOD test sets of size 3K each. Our evaluations show that BLIP-2 achieves a zero-shot recall at 1 (R@1) of 66\% on WIKIDO's OOD test set, compared to 81\% on MSCOCO and 95\% on Flickr30K. Fine-tuning on WIKIDO yields modest improvements, further demonstrating the benchmark's utility in testing OOD generalization. Our code and benchmark datasets will be released publicly.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Tankala et al. "WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models." NeurIPS 2024 Workshops: RBFM, 2024.

Markdown

[Tankala et al. "WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models." NeurIPS 2024 Workshops: RBFM, 2024.](https://mlanthology.org/neuripsw/2024/tankala2024neuripsw-wikido/)

BibTeX

@inproceedings{tankala2024neuripsw-wikido,
  title     = {{WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models}},
  author    = {Tankala, Pavan Kalyan and Pasi, Piyush Singh and Dharod, Sahil and Motiwala, Azeem and Jyothi, Preethi and Chaudhary, Aditi and Srinivasan, Krishna},
  booktitle = {NeurIPS 2024 Workshops: RBFM},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/tankala2024neuripsw-wikido/}
}