Learning Unseen Modality Interaction

Abstract

Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality’s prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval. Project website: https://xiaobai1217.github.io/Unseen-Modality-Interaction/.

Cite

Text

Zhang et al. "Learning Unseen Modality Interaction." Neural Information Processing Systems, 2023.

Markdown

[Zhang et al. "Learning Unseen Modality Interaction." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/zhang2023neurips-learning/)

BibTeX

@inproceedings{zhang2023neurips-learning,
  title     = {{Learning Unseen Modality Interaction}},
  author    = {Zhang, Yunhua and Doughty, Hazel and Snoek, Cees},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/zhang2023neurips-learning/}
}