Leveraging Visual Question Answering for Image-Caption Ranking
Abstract
Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a “feature extraction” module to extract image and caption representations. We employ these representations for the task of image-caption ranking. Each feature dimension captures (imagines) whether a fact (question-answer pair) could plausibly be true for the image and caption. This allows the model to interpret images and captions from a wide variety of perspectives. We propose score-level and representation-level fusion models to incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic image-caption ranking model. We find that incorporating and reasoning about consistency between images and captions significantly improves performance. Concretely, our model improves state-of-the-art on caption retrieval by 7.1 % and on image retrieval by 4.4 % on the MSCOCO dataset.
Cite
Text
Lin and Parikh. "Leveraging Visual Question Answering for Image-Caption Ranking." European Conference on Computer Vision, 2016. doi:10.1007/978-3-319-46475-6_17Markdown
[Lin and Parikh. "Leveraging Visual Question Answering for Image-Caption Ranking." European Conference on Computer Vision, 2016.](https://mlanthology.org/eccv/2016/lin2016eccv-leveraging/) doi:10.1007/978-3-319-46475-6_17BibTeX
@inproceedings{lin2016eccv-leveraging,
title = {{Leveraging Visual Question Answering for Image-Caption Ranking}},
author = {Lin, Xiao and Parikh, Devi},
booktitle = {European Conference on Computer Vision},
year = {2016},
pages = {261-277},
doi = {10.1007/978-3-319-46475-6_17},
url = {https://mlanthology.org/eccv/2016/lin2016eccv-leveraging/}
}