Improving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial Relations

Abstract

Phrase grounding is a multi-modal problem that localizes a particular noun phrase in an image referred to by a text query. In the challenging zero-shot phrase grounding setting, the existing state-of-the-art grounding models have limited capacity in handling the unseen phrases. Humans, however, can ground novel types of objects in images with little effort, significantly benefiting from reasoning with commonsense. In this paper, we design a novel phrase grounding architecture that builds multi-modal knowledge graphs using external knowledge and then performs graph reasoning and spatial relation reasoning to localize the referred nouns phrases. We perform extensive experiments on different zero-shot grounding splits sub-sampled from the Flickr30K Entity and Visual Genome dataset, demonstrating that the proposed framework is orthogonal to backbone image encoders and outperforms the baselines by 2~3% in accuracy, resulting in a significant improvement under the standard evaluation metrics.

Cite

Text

Shi et al. "Improving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial Relations." AAAI Conference on Artificial Intelligence, 2022. doi:10.1609/AAAI.V36I2.20123

Markdown

[Shi et al. "Improving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial Relations." AAAI Conference on Artificial Intelligence, 2022.](https://mlanthology.org/aaai/2022/shi2022aaai-improving/) doi:10.1609/AAAI.V36I2.20123

BibTeX

@inproceedings{shi2022aaai-improving,
  title     = {{Improving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial Relations}},
  author    = {Shi, Zhan and Shen, Yilin and Jin, Hongxia and Zhu, Xiaodan},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2022},
  pages     = {2253-2261},
  doi       = {10.1609/AAAI.V36I2.20123},
  url       = {https://mlanthology.org/aaai/2022/shi2022aaai-improving/}
}