DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection

Zhang, Shuo; Huang, Jiaming; Tang, Wenbing; Wu, Yan; Hu, Terrence; Xu, Xiaogang; Liu, Jing

doi:10.1609/AAAI.V39I10.33096

DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection

Shuo Zhang, Jiaming Huang, Wenbing Tang, Yan Wu, Terrence Hu, Xiaogang Xu, Jing Liu

AAAI 2025 pp. 10103-10111

doi:10.1609/AAAI.V39I10.33096 /aaai/2025/zhang2025aaai-dimsod/

Abstract

Multi-modal salient object detection (SOD) through the integration of additional data such as depth or thermal information has become a significant task in computer vision during recent years. Traditionally, the challenges of identifying salient objects in RGB, RGB-D (Depth), and RGB-T (Thermal) images are tackled separately. However, without intricate cross-modal fusion strategies, such approaches struggle to effectively integrate multi-modal information, often resulting in poorly defined object edges or overconfident inaccurate predictions. Recent studies have shown that designing a unified end-to-end framework to handle all three types of SOD tasks simultaneously is both necessary and difficult. To address this need, we propose a novel approach that treats multi-modal SOD as a conditional mask generation task utilizing diffusion models. We introduce DiMSOD, which enables the concurrent use of local (depth maps, thermal maps) and global controls (original images) within a unified model for progressive denoising and refined prediction. DiMSOD is efficient, only requiring fine-tuning of our newly introduced modules on the existing stable diffusion, which not only reduces the fine-tuning cost, making it more viable for practical use, but also enhances the integration of multi-modal conditional controls. Specifically, we have developed modules including SOD-ControlNet, Feature Adaptive Network (FAN), and Feature Injection Attention Network (FIAN) to enhance the model's performance. Extensive experiments demonstrate that DiMSOD efficiently detects salient objects across RGB, RGB-D, and RGB-T datasets, achieving superior performance compared to previous well-established methods.

PDF AAAI Semantic Scholar

Cite

Text

Zhang et al. "DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I10.33096

Markdown

[Zhang et al. "DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zhang2025aaai-dimsod/) doi:10.1609/AAAI.V39I10.33096

BibTeX

@inproceedings{zhang2025aaai-dimsod,
  title     = {{DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection}},
  author    = {Zhang, Shuo and Huang, Jiaming and Tang, Wenbing and Wu, Yan and Hu, Terrence and Xu, Xiaogang and Liu, Jing},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {10103-10111},
  doi       = {10.1609/AAAI.V39I10.33096},
  url       = {https://mlanthology.org/aaai/2025/zhang2025aaai-dimsod/}
}