Extending Multi-Modal Contrastive Representations

Abstract

Multi-modal contrastive representation (MCR) of more than three modalities is critical in multi-modal learning. Although recent methods showcase impressive achievements, the high dependence on large-scale, high-quality paired data and the expensive training costs limit their further development. Inspired by recent C-MCR, this paper proposes $\textbf{Ex}$tending $\textbf{M}$ultimodal $\textbf{C}$ontrastive $\textbf{R}$epresentation (Ex-MCR), a training-efficient and paired-data-free method to build unified contrastive representation for many modalities. Since C-MCR is designed to learn a new latent space for the two non-overlapping modalities and projects them onto this space, a significant amount of information from their original spaces is lost in the projection process. To address this issue, Ex-MCR proposes to extend one modality's space into the other's, rather than mapping both modalities onto a completely new space. This method effectively preserves semantic alignment in the original space. Experimentally, we extend pre-trained audio-text and 3D-image representations to the existing vision-text space. Without using paired data, Ex-MCR achieves comparable performance to advanced methods on a series of audio-image-text and 3D-image-text tasks and achieves superior performance when used in parallel with data-driven methods. Moreover, semantic alignment also emerges between the extended modalities (e.g., audio and 3D).

Cite

Text

Zhang et al. "Extending Multi-Modal Contrastive Representations." Neural Information Processing Systems, 2024. doi:10.52202/079017-2915

Markdown

[Zhang et al. "Extending Multi-Modal Contrastive Representations." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/zhang2024neurips-extending/) doi:10.52202/079017-2915

BibTeX

@inproceedings{zhang2024neurips-extending,
  title     = {{Extending Multi-Modal Contrastive Representations}},
  author    = {Zhang, Ziang and Wang, Zehan and Liu, Luping and Huang, Rongjie and Cheng, Xize and Ye, Zhenhui and Lin, Wang and Liu, Huadai and Huang, Haifeng and Zhao, Yang and Jin, Tao and Zheng, Siqi and Zhao, Zhou},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-2915},
  url       = {https://mlanthology.org/neurips/2024/zhang2024neurips-extending/}
}