Grounded Image Text Matching with Mismatched Relation Reasoning

Abstract

This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. GITM-MR requires a model to first determine if an expression describes an image, then localize referred objects or ground the mismatched parts of the text. We provide a benchmark for evaluating vision-language (VL) models on this task, with a focus on the challenging settings of limited training data and out-of-distribution sentence lengths. Our evaluation demonstrates that pre-trained VL models often lack data efficiency and length generalization ability. To address this, we propose the Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates relation-aware reasoning via bi-directional message propagation guided by language structure. Our RCRN can be interpreted as a modular program and delivers strong performance in terms of both length generalization and data efficiency. The code and data are available on https://github.com/SHTUPLUS/GITM-MR.

Cite

Text

Wu et al. "Grounded Image Text Matching with Mismatched Relation Reasoning." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00278

Markdown

[Wu et al. "Grounded Image Text Matching with Mismatched Relation Reasoning." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/wu2023iccv-grounded/) doi:10.1109/ICCV51070.2023.00278

BibTeX

@inproceedings{wu2023iccv-grounded,
  title     = {{Grounded Image Text Matching with Mismatched Relation Reasoning}},
  author    = {Wu, Yu and Wei, Yana and Wang, Haozhe and Liu, Yongfei and Yang, Sibei and He, Xuming},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {2976-2987},
  doi       = {10.1109/ICCV51070.2023.00278},
  url       = {https://mlanthology.org/iccv/2023/wu2023iccv-grounded/}
}