CMFS: CLIP-Guided Modality Interaction for Mitigating Noise in Multi-Modal Image Fusion and Segmentation

Abstract

Infrared-visible image fusion and semantic segmentation are pivotal tasks for robust scene understanding under challenging conditions such as low light. However, existing methods often struggle with high noise, modality inconsistencies, and inefficient cross-modal interactions, limiting fusion quality and segmentation accuracy. To this end, we propose CMFS, a unified framework that leverages CLIP-guided modality interaction to mitigate noise in multi-modal image fusion and segmentation. Our approach features a region-aware Modal Interaction Alignment module that combines a VMamba-based encoder with an additional shuffle layer to obtain more robust features and a CLIP-guided, regionally constrained multi-modal feature interaction block to emphasize foreground targets while suppressing low-light noise. Additionally, a Frequency-Spatial Collaboration module uses selective scanning and integrates wavelet-, spatial-, and Fourier-domain features to achieve adaptive denoising and balanced feature allocation. Furthermore, we employ a low-rank mixture-of-experts with dynamic routing to improve region-specific fusion and enhance pixel-level accuracy. Extensive experiments on several benchmarks show that, compared with state-of-the-art methods, the proposed approach demonstrates effectiveness in both image fusion quality and semantic segmentation accuracy, especially in complex environments. The source code will be released at IJCAI2025-CMFS.

Cite

Text

Su et al. "CMFS: CLIP-Guided Modality Interaction for Mitigating Noise in Multi-Modal Image Fusion and Segmentation." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/209

Markdown

[Su et al. "CMFS: CLIP-Guided Modality Interaction for Mitigating Noise in Multi-Modal Image Fusion and Segmentation." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/su2025ijcai-cmfs/) doi:10.24963/IJCAI.2025/209

BibTeX

@inproceedings{su2025ijcai-cmfs,
  title     = {{CMFS: CLIP-Guided Modality Interaction for Mitigating Noise in Multi-Modal Image Fusion and Segmentation}},
  author    = {Su, Guilin and Huang, Yuqing and Yang, Chao and He, Zhenyu},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {1873-1881},
  doi       = {10.24963/IJCAI.2025/209},
  url       = {https://mlanthology.org/ijcai/2025/su2025ijcai-cmfs/}
}