AVSal: Enhancing Video Saliency Prediction Through Audio-Visual Fusion and Temporal Aggregation

Abstract

Visual saliency prediction is critical for understanding human attention in video content and supports various applications. In this paper, we introduce AVSal, an advanced audio-visual saliency prediction model designed to enhance the accuracy of video saliency prediction. AVSal leverages foundation models, specifically CLIP and ImageBind, for robust and high-quality feature extraction from both visual and auditory inputs. Then, a novel cross-attention-based fusion mechanism is employed to effectively integrate audio and visual features at multiple levels, capturing the intricate relationships between these modalities. Additionally, a spatio-temporal GRU architecture is implemented to preserve critical temporal dynamics, improving the model’s accuracy in predicting saliency in dynamic scenes. Extensive experimental results demonstrate that the proposed AVSal model performs excellently in the ECCV AIM Video Saliency Prediction Challenge 2024 and significantly outperforms other state-of-the-art models in six other mainstream audio-visual saliency datasets.

Cite

Text

Zhu et al. "AVSal: Enhancing Video Saliency Prediction Through Audio-Visual Fusion and Temporal Aggregation." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-91856-8_8

Markdown

[Zhu et al. "AVSal: Enhancing Video Saliency Prediction Through Audio-Visual Fusion and Temporal Aggregation." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/zhu2024eccvw-avsal/) doi:10.1007/978-3-031-91856-8_8

BibTeX

@inproceedings{zhu2024eccvw-avsal,
  title     = {{AVSal: Enhancing Video Saliency Prediction Through Audio-Visual Fusion and Temporal Aggregation}},
  author    = {Zhu, Yuxin and Sun, Yinan and Duan, Huiyu and Cao, Yuqin and Jia, Ziheng and Hu, Qiang and Min, Xiongkuo and Zhai, Guangtao},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2024},
  pages     = {127-143},
  doi       = {10.1007/978-3-031-91856-8_8},
  url       = {https://mlanthology.org/eccvw/2024/zhu2024eccvw-avsal/}
}