Multi-Level Visual-Semantic Alignments with Relation-Wise Dual Attention Network for Image and Text Matching

Abstract

Image-text matching is central to visual-semantic cross-modal retrieval and has been attracting extensive attention recently. Previous studies have been devoted to finding the latent correspondence between image regions and words, e.g., connecting key words to specific regions of salient objects. However, existing methods are usually committed to handle concrete objects, rather than abstract ones, e.g., a description of some action, which in fact are also ubiquitous in description texts of real-world. The main challenge in dealing with abstract objects is that there is no explicit connections between them, unlike their concrete counterparts. One therefore has to alternatively find the implicit and intrinsic connections between them. In this paper, we propose a relation-wise dual attention network (RDAN) for image-text matching. Specifically, we maintain an over-complete set that contains pairs of regions and words. Then built upon this set, we encode the local correlations and the global dependencies between regions and words by training a visual-semantic network. Then a dual pathway attention network is presented to infer the visual-semantic alignments and image-text similarity. Extensive experiments validate the efficacy of our method, by achieving the state-of-the-art performance on several public benchmark datasets.

Cite

Text

Hu et al. "Multi-Level Visual-Semantic Alignments with Relation-Wise Dual Attention Network for Image and Text Matching." International Joint Conference on Artificial Intelligence, 2019. doi:10.24963/IJCAI.2019/111

Markdown

[Hu et al. "Multi-Level Visual-Semantic Alignments with Relation-Wise Dual Attention Network for Image and Text Matching." International Joint Conference on Artificial Intelligence, 2019.](https://mlanthology.org/ijcai/2019/hu2019ijcai-multi/) doi:10.24963/IJCAI.2019/111

BibTeX

@inproceedings{hu2019ijcai-multi,
  title     = {{Multi-Level Visual-Semantic Alignments with Relation-Wise Dual Attention Network for Image and Text Matching}},
  author    = {Hu, Zhibin and Luo, Yongsheng and Lin, Jiong and Yan, Yan and Chen, Jian},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2019},
  pages     = {789-795},
  doi       = {10.24963/IJCAI.2019/111},
  url       = {https://mlanthology.org/ijcai/2019/hu2019ijcai-multi/}
}