Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

Wu, Hao; Mao, Jiayuan; Zhang, Yufeng; Jiang, Yuning; Li, Lei; Sun, Weiwei; Ma, Wei-Ying

doi:10.1109/CVPR.2019.00677

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma

CVPR 2019

doi:10.1109/CVPR.2019.00677 /cvpr/2019/wu2019cvpr-unified/

Abstract

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for learning a joint space of visual representation and textual semantics. The model unifies the embeddings of concepts at different levels: objects, attributes, relations, and full scenes. We view the sentential semantics as a combination of different semantic components such as objects and relations; their embeddings are aligned with different image regions. A contrastive learning approach is proposed for the effective learning of this fine-grained alignment from only image-caption pairs. We also present a simple yet effective approach that enforces the coverage of caption embeddings on the semantic components that appear in the sentence. We demonstrate that the Unified VSE outperforms baselines on cross-modal retrieval tasks; the enforcement of the semantic coverage improves the model's robustness in defending text-domain adversarial attacks. Moreover, our model empowers the use of visual cues to accurately resolve word dependencies in novel sentences.

PDF CVPR Semantic Scholar

Cite

Text

Wu et al. "Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. doi:10.1109/CVPR.2019.00677

Markdown

[Wu et al. "Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.](https://mlanthology.org/cvpr/2019/wu2019cvpr-unified/) doi:10.1109/CVPR.2019.00677

BibTeX

@inproceedings{wu2019cvpr-unified,
  title     = {{Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations}},
  author    = {Wu, Hao and Mao, Jiayuan and Zhang, Yufeng and Jiang, Yuning and Li, Lei and Sun, Weiwei and Ma, Wei-Ying},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2019},
  doi       = {10.1109/CVPR.2019.00677},
  url       = {https://mlanthology.org/cvpr/2019/wu2019cvpr-unified/}
}