Fast Self-Attentive Multimodal Retrieval

Abstract

Multimodal bidirectional retrieval is a very challenging task that consists of semantically aligning two distinct modalities such as images and textual descriptions, allowing the retrieval of content from one of the modalities given the other. The goal is to find a common semantic space for both modalities in order to discover the correspondences between them. This paper provides a fast and effective attention-based architecture for learning representations for multimodal retrieval, namely SEAM. It is based on a self-attention module that is designed to enhance relevant textual information from word-embeddings while suppressing irrelevant data. We design three incarnations of SEAM, so we can properly assess the performance of the self-attention module when operating over distinct representations, namely: (i) the word-embeddings themselves; (ii) features learned from convolutional layers in distinct granularities; and (iii) features learned from gated recurrent units (GRUs). The output of the self-attention module is projected over a shared multimodal space, so we can map the semantic correspondence between images and descriptions via a contrastive pairwise loss function that minimizes order-violations. We analyze several architectural choices for our approach, and we compare our best models with the current state-of-the-art approaches in the largest and most well-known multimodal retrieval dataset, namely Microsoft COCO. Results show that SEAM outperforms the current state-of-the-art in most cases while being a much faster approach for the task of multimodal retrieval.

Cite

Text

Wehrmann et al. "Fast Self-Attentive Multimodal Retrieval." IEEE/CVF Winter Conference on Applications of Computer Vision, 2018. doi:10.1109/WACV.2018.00207

Markdown

[Wehrmann et al. "Fast Self-Attentive Multimodal Retrieval." IEEE/CVF Winter Conference on Applications of Computer Vision, 2018.](https://mlanthology.org/wacv/2018/wehrmann2018wacv-fast/) doi:10.1109/WACV.2018.00207

BibTeX

@inproceedings{wehrmann2018wacv-fast,
  title     = {{Fast Self-Attentive Multimodal Retrieval}},
  author    = {Wehrmann, Jonatas and Lopes, Mauricio A. and Móre, Martin D. and Barros, Rodrigo C.},
  booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision},
  year      = {2018},
  pages     = {1871-1878},
  doi       = {10.1109/WACV.2018.00207},
  url       = {https://mlanthology.org/wacv/2018/wehrmann2018wacv-fast/}
}