Deep Cross-Modal Projection Learning for Image-Text Matching

Abstract

The key point of image-text matching is how to accurately measure the similarity between visual and textual inputs. Despite the great progress of associating the deep cross-modal embeddings with the bi-directional ranking loss, developing the strategies for mining useful triplets and selecting appropriate margins remains a challenge in real applications. In this paper, we propose a cross-modal projection matching (CMPM) loss and a cross-modal projection classification (CMPC) loss for learning discriminative image-text embeddings. The CMPM loss minimizes the KL divergence between the projection compatibility distributions and the normalized matching distributions defined with all the positive and negative samples in a mini-batch. The CMPC loss attempts to categorize the vector projection of representations from one modality onto another with the improved norm-softmax loss, for further enhancing the feature compactness of each class. Extensive analysis and experiments on multiple datasets demonstrate the superiority of the proposed approach.

Cite

Text

Zhang and Lu. "Deep Cross-Modal Projection Learning for Image-Text Matching." Proceedings of the European Conference on Computer Vision (ECCV), 2018. doi:10.1007/978-3-030-01246-5_42

Markdown

[Zhang and Lu. "Deep Cross-Modal Projection Learning for Image-Text Matching." Proceedings of the European Conference on Computer Vision (ECCV), 2018.](https://mlanthology.org/eccv/2018/zhang2018eccv-deep/) doi:10.1007/978-3-030-01246-5_42

BibTeX

@inproceedings{zhang2018eccv-deep,
  title     = {{Deep Cross-Modal Projection Learning for Image-Text Matching}},
  author    = {Zhang, Ying and Lu, Huchuan},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2018},
  doi       = {10.1007/978-3-030-01246-5_42},
  url       = {https://mlanthology.org/eccv/2018/zhang2018eccv-deep/}
}