Unifying Vision-Language Representation Space with Single-Tower Transformer

Jang, Jiho; Kong, Chaerin; Jeon, Donghyeon; Kim, Seonhoon; Kwak, Nojun

doi:10.1609/AAAI.V37I1.25178

Unifying Vision-Language Representation Space with Single-Tower Transformer

Jiho Jang, Chaerin Kong, Donghyeon Jeon, Seonhoon Kim, Nojun Kwak

AAAI 2023 pp. 980-988

doi:10.1609/AAAI.V37I1.25178 /aaai/2023/jang2023aaai-unifying/

Abstract

Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this work, we explore the hypothesis that an image and caption can be regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a one-tower model for vision-language pretraining (VLP), and propose One Representation (OneR) as a simple yet effective framework for our goal. We discover intriguing properties that distinguish OneR from the previous works that have modality-specific representation spaces such as zero-shot localization, text-guided visual reasoning and multi-modal retrieval, and present analyses to provide insights into this new form of multi-modal representation learning. Thorough evaluations demonstrate the potential of a unified modality-agnostic VLP framework.

PDF AAAI Semantic Scholar

Cite

Text

Jang et al. "Unifying Vision-Language Representation Space with Single-Tower Transformer." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I1.25178

Markdown

[Jang et al. "Unifying Vision-Language Representation Space with Single-Tower Transformer." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/jang2023aaai-unifying/) doi:10.1609/AAAI.V37I1.25178

BibTeX

@inproceedings{jang2023aaai-unifying,
  title     = {{Unifying Vision-Language Representation Space with Single-Tower Transformer}},
  author    = {Jang, Jiho and Kong, Chaerin and Jeon, Donghyeon and Kim, Seonhoon and Kwak, Nojun},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {980-988},
  doi       = {10.1609/AAAI.V37I1.25178},
  url       = {https://mlanthology.org/aaai/2023/jang2023aaai-unifying/}
}