AVSal: Enhancing Video Saliency Prediction Through Audio-Visual Fusion and Temporal Aggregation

Zhu, Yuxin; Sun, Yinan; Duan, Huiyu; Cao, Yuqin; Jia, Ziheng; Hu, Qiang; Min, Xiongkuo; Zhai, Guangtao

doi:10.1007/978-3-031-91856-8_8

AVSal: Enhancing Video Saliency Prediction Through Audio-Visual Fusion and Temporal Aggregation

Yuxin Zhu, Yinan Sun, Huiyu Duan, Yuqin Cao, Ziheng Jia, Qiang Hu, Xiongkuo Min, Guangtao Zhai

ECCVW 2024 pp. 127-143

doi:10.1007/978-3-031-91856-8_8 /eccvw/2024/zhu2024eccvw-avsal/

Abstract

Visual saliency prediction is critical for understanding human attention in video content and supports various applications. In this paper, we introduce AVSal, an advanced audio-visual saliency prediction model designed to enhance the accuracy of video saliency prediction. AVSal leverages foundation models, specifically CLIP and ImageBind, for robust and high-quality feature extraction from both visual and auditory inputs. Then, a novel cross-attention-based fusion mechanism is employed to effectively integrate audio and visual features at multiple levels, capturing the intricate relationships between these modalities. Additionally, a spatio-temporal GRU architecture is implemented to preserve critical temporal dynamics, improving the model’s accuracy in predicting saliency in dynamic scenes. Extensive experimental results demonstrate that the proposed AVSal model performs excellently in the ECCV AIM Video Saliency Prediction Challenge 2024 and significantly outperforms other state-of-the-art models in six other mainstream audio-visual saliency datasets.

PDF ECCVW Semantic Scholar

Cite

Text

Zhu et al. "AVSal: Enhancing Video Saliency Prediction Through Audio-Visual Fusion and Temporal Aggregation." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-91856-8_8

Markdown

[Zhu et al. "AVSal: Enhancing Video Saliency Prediction Through Audio-Visual Fusion and Temporal Aggregation." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/zhu2024eccvw-avsal/) doi:10.1007/978-3-031-91856-8_8

BibTeX

@inproceedings{zhu2024eccvw-avsal,
  title     = {{AVSal: Enhancing Video Saliency Prediction Through Audio-Visual Fusion and Temporal Aggregation}},
  author    = {Zhu, Yuxin and Sun, Yinan and Duan, Huiyu and Cao, Yuqin and Jia, Ziheng and Hu, Qiang and Min, Xiongkuo and Zhai, Guangtao},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2024},
  pages     = {127-143},
  doi       = {10.1007/978-3-031-91856-8_8},
  url       = {https://mlanthology.org/eccvw/2024/zhu2024eccvw-avsal/}
}