Learning a Recurrent Residual Fusion Network for Multimodal Matching

Abstract

A major challenge in matching between vision and language is that they typically have completely different features and representations. In this work, we introduce a novel bridge between the modality-specific representations by creating a co-embedding space based on a recurrent residual fusion (RRF) block. Specifically, RRF adapts the recurrent mechanism to residual learning, so that it can recursively improve feature embeddings while retaining the shared parameters. Then, a fusion module is used to integrate the intermediate recurrent outputs and generates a more powerful representation. In the matching network, RRF acts as a feature enhancement component to gather visual and textual representations into a more discriminative embedding space where it allows to narrow the cross-modal gap between vision and language. Moreover, we employ a bi-rank loss function to enforce separability of the two modalities in the embedding space. In the experiments, we evaluate the proposed RRF-Net using two multi-modal datasets where it achieves state-of-the-art results.

Cite

Text

Liu et al. "Learning a Recurrent Residual Fusion Network for Multimodal Matching." International Conference on Computer Vision, 2017. doi:10.1109/ICCV.2017.442

Markdown

[Liu et al. "Learning a Recurrent Residual Fusion Network for Multimodal Matching." International Conference on Computer Vision, 2017.](https://mlanthology.org/iccv/2017/liu2017iccv-learning-a/) doi:10.1109/ICCV.2017.442

BibTeX

@inproceedings{liu2017iccv-learning-a,
  title     = {{Learning a Recurrent Residual Fusion Network for Multimodal Matching}},
  author    = {Liu, Yu and Guo, Yanming and Bakker, Erwin M. and Lew, Michael S.},
  booktitle = {International Conference on Computer Vision},
  year      = {2017},
  doi       = {10.1109/ICCV.2017.442},
  url       = {https://mlanthology.org/iccv/2017/liu2017iccv-learning-a/}
}