Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Zhang, Yuhui; Sui, Elaine; Yeung, Serena

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

ICLR 2024

/iclr/2024/zhang2024iclr-connect/

Abstract

Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our $C^3$ method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation.

PDF ICLR Semantic Scholar

Cite

Text

Zhang et al. "Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data." International Conference on Learning Representations, 2024.

Markdown

[Zhang et al. "Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/zhang2024iclr-connect/)

BibTeX

@inproceedings{zhang2024iclr-connect,
  title     = {{Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data}},
  author    = {Zhang, Yuhui and Sui, Elaine and Yeung, Serena},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/zhang2024iclr-connect/}
}