Rethinking Multi-Modal Object Detection from the Perspective of Mono-Modality Feature Learning

Abstract

Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem, which arises from decreased feature extraction capability in multi-modal joint learning. This leads to a prevalent but unreasonable phenomenon\textemdash Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct a novel framework called M2D-LIF, which consists of the Mono-Modality Distillation (M2D) method and the Local Illumination-aware Fusion (LIF) module. The M2D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M2D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors. The codes are available at https://github.com/Zhao-Tian-yi/M2D-LIF.

Cite

Text

Zhao et al. "Rethinking Multi-Modal Object Detection from the Perspective of Mono-Modality Feature Learning." International Conference on Computer Vision, 2025.

Markdown

[Zhao et al. "Rethinking Multi-Modal Object Detection from the Perspective of Mono-Modality Feature Learning." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zhao2025iccv-rethinking/)

BibTeX

@inproceedings{zhao2025iccv-rethinking,
  title     = {{Rethinking Multi-Modal Object Detection from the Perspective of Mono-Modality Feature Learning}},
  author    = {Zhao, Tianyi and Liu, Boyang and Gao, Yanglei and Sun, Yiming and Yuan, Maoxun and Wei, Xingxing},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {6364-6373},
  url       = {https://mlanthology.org/iccv/2025/zhao2025iccv-rethinking/}
}