Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models

Abstract

Missing modalities present a fundamental challenge in multimodal models, often causing catastrophic performance degradation. Our observations suggest that this fragility stems from an imbalanced learning process, where the model develops an implicit preference for certain modalities, leading to under-optimization of others. We propose a simple yet efficient method to address this challenge. The central insight of our work is that the dominance relationship between modalities can be effectively discerned and quantified in the frequency domain. To leverage this principle, we first introduce a **F**requency **R**atio **M**etric (FRM) to quantify modality preference by analyzing features in the frequency domain. Guided by FRM, we propose a **M**ultimodal **W**eight **A**llocation **M**odule, a plug-and-play component that dynamically rebalances the contribution of each branch during training, thereby promoting a more holistic learning paradigm. Extensive experiments demonstrate that MWAM can be seamlessly integrated into diverse architectural backbones, such as those based on CNNs and ViTs. Furthermore, MWAM delivers consistent performance gains across a wide range of tasks and modality combinations. This advancement extends beyond merely optimizing the performance of the base model; it also yields further improvements to state-of-the-art methods designed to address the missing modality problem.

Cite

Text

Lu et al. "Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models." International Conference on Learning Representations, 2026.

Markdown

[Lu et al. "Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/lu2026iclr-plug/)

BibTeX

@inproceedings{lu2026iclr-plug,
  title     = {{Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models}},
  author    = {Lu, Siqi and Xu, Wanying and Zheng, Yongbin and Luan, Wenting and Sun, Peng and Yao, Jianhang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/lu2026iclr-plug/}
}