DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection
Abstract
Multi-modal salient object detection (SOD) through the integration of additional data such as depth or thermal information has become a significant task in computer vision during recent years. Traditionally, the challenges of identifying salient objects in RGB, RGB-D (Depth), and RGB-T (Thermal) images are tackled separately. However, without intricate cross-modal fusion strategies, such approaches struggle to effectively integrate multi-modal information, often resulting in poorly defined object edges or overconfident inaccurate predictions. Recent studies have shown that designing a unified end-to-end framework to handle all three types of SOD tasks simultaneously is both necessary and difficult. To address this need, we propose a novel approach that treats multi-modal SOD as a conditional mask generation task utilizing diffusion models. We introduce DiMSOD, which enables the concurrent use of local (depth maps, thermal maps) and global controls (original images) within a unified model for progressive denoising and refined prediction. DiMSOD is efficient, only requiring fine-tuning of our newly introduced modules on the existing stable diffusion, which not only reduces the fine-tuning cost, making it more viable for practical use, but also enhances the integration of multi-modal conditional controls. Specifically, we have developed modules including SOD-ControlNet, Feature Adaptive Network (FAN), and Feature Injection Attention Network (FIAN) to enhance the model's performance. Extensive experiments demonstrate that DiMSOD efficiently detects salient objects across RGB, RGB-D, and RGB-T datasets, achieving superior performance compared to previous well-established methods.
Cite
Text
Zhang et al. "DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I10.33096Markdown
[Zhang et al. "DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zhang2025aaai-dimsod/) doi:10.1609/AAAI.V39I10.33096BibTeX
@inproceedings{zhang2025aaai-dimsod,
title = {{DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection}},
author = {Zhang, Shuo and Huang, Jiaming and Tang, Wenbing and Wu, Yan and Hu, Terrence and Xu, Xiaogang and Liu, Jing},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {10103-10111},
doi = {10.1609/AAAI.V39I10.33096},
url = {https://mlanthology.org/aaai/2025/zhang2025aaai-dimsod/}
}