Emphasizing Complementary Samples for Non-Literal Cross-Modal Retrieval
Abstract
Existing cross-modal retrieval methods assume a straightforward relationship where images and text contain portrayals or mentions of the same objects. In contrast, real-world image-text pairs (e.g. an image and its caption in a news article) often feature more complex relations. Importantly, not all image-text pairs have the same relationship: in some pairs, image and text may be more closely aligned, while others are more loosely aligned hence complementary. In order to ensure the model learns a semantically robust space which captures nuanced relationships, care must be taken that loosely-aligned image-text pairs have a strong enough impact on learning. In this paper, we propose a novel approach to prioritize loosely-aligned samples. Unlike prior sample weighting methods, ours relies on estimating to what extent semantic similarity is preserved in the separate channels (images/text) in the learned multimodal space. In particular, the image-text pair weights in the retrieval loss focus learning towards samples from diverse or discrepant neighborhoods: samples where images or text that were close in a semantic space, are distant in the cross-modal space (diversity), or where neighbor relations are asymmetric (discrepancy). Experiments on three challenging datasets exhibiting abstract image-text relations, as well as COCO, demonstrate significant performance gains compared to recent state-of-the-art models and sample weighting approaches.
Cite
Text
Thomas and Kovashka. "Emphasizing Complementary Samples for Non-Literal Cross-Modal Retrieval." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022. doi:10.1109/CVPRW56347.2022.00509Markdown
[Thomas and Kovashka. "Emphasizing Complementary Samples for Non-Literal Cross-Modal Retrieval." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022.](https://mlanthology.org/cvprw/2022/thomas2022cvprw-emphasizing/) doi:10.1109/CVPRW56347.2022.00509BibTeX
@inproceedings{thomas2022cvprw-emphasizing,
title = {{Emphasizing Complementary Samples for Non-Literal Cross-Modal Retrieval}},
author = {Thomas, Christopher and Kovashka, Adriana},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2022},
pages = {4631-4640},
doi = {10.1109/CVPRW56347.2022.00509},
url = {https://mlanthology.org/cvprw/2022/thomas2022cvprw-emphasizing/}
}