WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models

Abstract

Cross-modal retrieval tasks, such as image-to-text and text-to-image, are crucial for evaluating vision-language models (VLMs). State-of-the-art VLMs like CLIP and BLIP-2 achieve impressive performance on benchmarks such as MSCOCO and Flickr30K. However, due to the high similarity between evaluation datasets (e.g., Flickr30K) and fine-tuning datasets (e.g., MSCOCO), these benchmarks are insufficient for assessing the out-of-distribution (OOD) generalization capabilities of VLMs. We introduce $\textbf{WIKIDO}$ (derived from $\textbf{Wiki}$pedia $\textbf{D}$iversity $\textbf{O}$bservatory), a new benchmark featuring 384K image-text pairs, alongside carefully curated, human-verified in-distribution (ID) and OOD test sets of size 3K each. Our evaluations show that BLIP-2 achieves a zero-shot recall at 1 (R@1) of 66\% on WIKIDO's OOD test set, compared to 81\% on MSCOCO and 95\% on Flickr30K. Fine-tuning on WIKIDO yields modest improvements, further demonstrating the benchmark's utility in testing OOD generalization. Our code and benchmark datasets will be released publicly.

Cite

Text

Tankala et al. "WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models." NeurIPS 2024 Workshops: RBFM, 2024.

Markdown

[Tankala et al. "WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models." NeurIPS 2024 Workshops: RBFM, 2024.](https://mlanthology.org/neuripsw/2024/tankala2024neuripsw-wikido/)

BibTeX

@inproceedings{tankala2024neuripsw-wikido,
  title     = {{WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models}},
  author    = {Tankala, Pavan Kalyan and Pasi, Piyush Singh and Dharod, Sahil and Motiwala, Azeem and Jyothi, Preethi and Chaudhary, Aditi and Srinivasan, Krishna},
  booktitle = {NeurIPS 2024 Workshops: RBFM},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/tankala2024neuripsw-wikido/}
}