$\texttt{I}$^2$MoE$: Interpretable Multimodal Interaction-Aware Mixture-of-Experts

Abstract

Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, existing approaches are limited by $\textbf{(a)}$ their focus on modality correspondences, which neglects heterogeneous interactions between modalities, and $\textbf{(b)}$ the fact that they output a single multimodal prediction without offering interpretable insights into the multimodal interactions present in the data. In this work, we propose $\texttt{I$^2$MoE}$ ($\underline{I}$nterpretable Multimodal $\underline{I}$nteraction-aware $\underline{M}$ixture-$\underline{o}$f-$\underline{E}$xperts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, $\texttt{I$^2$MoE}$ utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, $\texttt{I$^2$MoE}$ deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that $\texttt{I$^2$MoE}$ is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at https://github.com/Raina-Xin/I2MoE.

Cite

Text

Xin et al. "$\texttt{I}$^2$MoE$: Interpretable Multimodal Interaction-Aware Mixture-of-Experts." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Xin et al. "$\texttt{I}$^2$MoE$: Interpretable Multimodal Interaction-Aware Mixture-of-Experts." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/xin2025icml-2moe/)

BibTeX

@inproceedings{xin2025icml-2moe,
  title     = {{$\texttt{I}$^2$MoE$: Interpretable Multimodal Interaction-Aware Mixture-of-Experts}},
  author    = {Xin, Jiayi and Yun, Sukwon and Peng, Jie and Choi, Inyoung and Ballard, Jenna L. and Chen, Tianlong and Long, Qi},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {68870-68888},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/xin2025icml-2moe/}
}