Linking Image and Text with 2-Way Nets
Abstract
Linking two data sources is a basic building block in numerous computer vision problems. Canonical Correlation Analysis (CCA) achieves this by utilizing a linear optimizer in order to maximize the correlation between the two views. Recent work makes use of non-linear models, including deep learning techniques, that optimize the CCA loss in some feature space. In this paper, we introduce a novel, bi-directional neural network architecture for the task of matching vectors from two data sources. Our approach employs two tied neural network channels that project the two views into a common, maximally correlated space using the Euclidean loss. We show a direct link between the correlation-based loss and Euclidean loss, enabling the use of Euclidean loss for correlation maximization. To overcome common Euclidean regression optimization problems, we modify well-known techniques to our problem, including batch normalization and dropout. We show state of the art results on a number of computer vision matching tasks including MNIST image matching and sentence-image matching on the Flickr8k, Flickr30k and COCO datasets.
Cite
Text
Eisenschtat and Wolf. "Linking Image and Text with 2-Way Nets." Conference on Computer Vision and Pattern Recognition, 2017. doi:10.1109/CVPR.2017.201Markdown
[Eisenschtat and Wolf. "Linking Image and Text with 2-Way Nets." Conference on Computer Vision and Pattern Recognition, 2017.](https://mlanthology.org/cvpr/2017/eisenschtat2017cvpr-linking/) doi:10.1109/CVPR.2017.201BibTeX
@inproceedings{eisenschtat2017cvpr-linking,
title = {{Linking Image and Text with 2-Way Nets}},
author = {Eisenschtat, Aviv and Wolf, Lior},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2017},
doi = {10.1109/CVPR.2017.201},
url = {https://mlanthology.org/cvpr/2017/eisenschtat2017cvpr-linking/}
}