DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

Xiong, Junwen; Zhang, Peng; You, Tao; Li, Chuanyue; Huang, Wei; Zha, Yufei

doi:10.1109/CVPR52733.2024.02575

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

Junwen Xiong, Peng Zhang, Tao You, Chuanyue Li, Wei Huang, Yufei Zha

CVPR 2024 pp. 27273-27283

doi:10.1109/CVPR52733.2024.02575 /cvpr/2024/xiong2024cvpr-diffsal/

Abstract

Audio-visual saliency prediction can draw support from diverse modality complements but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks with an average relative improvement of 6.3% over the previous state-of-the-art results by six metrics.

PDF CVPR Semantic Scholar

Cite

Text

Xiong et al. "DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02575

Markdown

[Xiong et al. "DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/xiong2024cvpr-diffsal/) doi:10.1109/CVPR52733.2024.02575

BibTeX

@inproceedings{xiong2024cvpr-diffsal,
  title     = {{DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction}},
  author    = {Xiong, Junwen and Zhang, Peng and You, Tao and Li, Chuanyue and Huang, Wei and Zha, Yufei},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {27273-27283},
  doi       = {10.1109/CVPR52733.2024.02575},
  url       = {https://mlanthology.org/cvpr/2024/xiong2024cvpr-diffsal/}
}