AVSal: Enhancing Video Saliency Prediction Through Audio-Visual Fusion and Temporal Aggregation
Abstract
Visual saliency prediction is critical for understanding human attention in video content and supports various applications. In this paper, we introduce AVSal, an advanced audio-visual saliency prediction model designed to enhance the accuracy of video saliency prediction. AVSal leverages foundation models, specifically CLIP and ImageBind, for robust and high-quality feature extraction from both visual and auditory inputs. Then, a novel cross-attention-based fusion mechanism is employed to effectively integrate audio and visual features at multiple levels, capturing the intricate relationships between these modalities. Additionally, a spatio-temporal GRU architecture is implemented to preserve critical temporal dynamics, improving the model’s accuracy in predicting saliency in dynamic scenes. Extensive experimental results demonstrate that the proposed AVSal model performs excellently in the ECCV AIM Video Saliency Prediction Challenge 2024 and significantly outperforms other state-of-the-art models in six other mainstream audio-visual saliency datasets.
Cite
Text
Zhu et al. "AVSal: Enhancing Video Saliency Prediction Through Audio-Visual Fusion and Temporal Aggregation." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-91856-8_8Markdown
[Zhu et al. "AVSal: Enhancing Video Saliency Prediction Through Audio-Visual Fusion and Temporal Aggregation." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/zhu2024eccvw-avsal/) doi:10.1007/978-3-031-91856-8_8BibTeX
@inproceedings{zhu2024eccvw-avsal,
title = {{AVSal: Enhancing Video Saliency Prediction Through Audio-Visual Fusion and Temporal Aggregation}},
author = {Zhu, Yuxin and Sun, Yinan and Duan, Huiyu and Cao, Yuqin and Jia, Ziheng and Hu, Qiang and Min, Xiongkuo and Zhai, Guangtao},
booktitle = {European Conference on Computer Vision Workshops},
year = {2024},
pages = {127-143},
doi = {10.1007/978-3-031-91856-8_8},
url = {https://mlanthology.org/eccvw/2024/zhu2024eccvw-avsal/}
}