Action-Aware Embedding Enhancement for Image-Text Retrieval

Abstract

Image-text retrieval plays a central role in bridging vision and language, which aims to reduce the semantic discrepancy between images and texts. Most of existing works rely on refined words and objects representation through the data-oriented method to capture the word-object cooccurrence. Such approaches are prone to ignore the asymmetric action relation between images and texts, that is, the text has explicit action representation (i.e., verb phrase) while the image only contains implicit action information. In this paper, we propose Action-aware Memory-Enhanced embedding (AME) method for image-text retrieval, which aims to emphasize the action information when mapping the images and texts into a shared embedding space. Specifically, we integrate action prediction along with an action-aware memory bank to enrich the image and text features with action-similar text features. The effectiveness of our proposed AME method is verified by comprehensive experimental results on two benchmark datasets.

Cite

Text

Li et al. "Action-Aware Embedding Enhancement for Image-Text Retrieval." AAAI Conference on Artificial Intelligence, 2022. doi:10.1609/AAAI.V36I2.20020

Markdown

[Li et al. "Action-Aware Embedding Enhancement for Image-Text Retrieval." AAAI Conference on Artificial Intelligence, 2022.](https://mlanthology.org/aaai/2022/li2022aaai-action/) doi:10.1609/AAAI.V36I2.20020

BibTeX

@inproceedings{li2022aaai-action,
  title     = {{Action-Aware Embedding Enhancement for Image-Text Retrieval}},
  author    = {Li, Jiangtong and Niu, Li and Zhang, Liqing},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2022},
  pages     = {1323-1331},
  doi       = {10.1609/AAAI.V36I2.20020},
  url       = {https://mlanthology.org/aaai/2022/li2022aaai-action/}
}