Disentangled Cross-Modal Representation Learning with Enhanced Mutual Supervision
Abstract
Cross-modal representation learning aims to extract semantically aligned representations from heterogeneous modalities such as images and text. Existing multimodal VAE-based models often suffer from limited capability to align heterogeneous modalities or lack sufficient structural constraints to clearly separate the modality-specific and shared factors. In this work, we propose a novel framework, termed **D**isentangled **C**ross-**M**odal Representation Learning with **E**nhanced **M**utual Supervision (DCMEM). Specifically, our model disentangles the common and distinct information across modalities and regularizes the shared representation learned from each modality in a mutually supervised manner. Moreover, we incorporate the information bottleneck principle into our model to ensure that the shared and modality-specific factors encode exclusive yet complementary information. Notably, our model is designed to be trainable on both complete and partial multimodal datasets with a valid Evidence Lower Bound. Extensive experimental results demonstrate significant improvements of our model over existing methods on various tasks including cross-modal generation, clustering, and classification.
Cite
Text
Gao et al. "Disentangled Cross-Modal Representation Learning with Enhanced Mutual Supervision." Advances in Neural Information Processing Systems, 2025.Markdown
[Gao et al. "Disentangled Cross-Modal Representation Learning with Enhanced Mutual Supervision." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/gao2025neurips-disentangled/)BibTeX
@inproceedings{gao2025neurips-disentangled,
title = {{Disentangled Cross-Modal Representation Learning with Enhanced Mutual Supervision}},
author = {Gao, Lu and Chen, Wenlan and Wang, Daoyuan and Guo, Fei and Liang, Cheng},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/gao2025neurips-disentangled/}
}