Contrastive Learning for Weakly Supervised Phrase Grounding
Abstract
Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a $\sim10\%$ absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7\%$ to achieve $76.7\%$ accuracy on Flickr30K Entities benchmark. Our code and project material will be available at http://tanmaygupta.info/info-ground.
Cite
Text
Gupta et al. "Contrastive Learning for Weakly Supervised Phrase Grounding." Proceedings of the European Conference on Computer Vision (ECCV), 2020. doi:10.1007/978-3-030-58580-8_44Markdown
[Gupta et al. "Contrastive Learning for Weakly Supervised Phrase Grounding." Proceedings of the European Conference on Computer Vision (ECCV), 2020.](https://mlanthology.org/eccv/2020/gupta2020eccv-contrastive/) doi:10.1007/978-3-030-58580-8_44BibTeX
@inproceedings{gupta2020eccv-contrastive,
title = {{Contrastive Learning for Weakly Supervised Phrase Grounding}},
author = {Gupta, Tanmay and Vahdat, Arash and Chechik, Gal and Yang, Xiaodong and Kautz, Jan and Hoiem, Derek},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2020},
doi = {10.1007/978-3-030-58580-8_44},
url = {https://mlanthology.org/eccv/2020/gupta2020eccv-contrastive/}
}