Audio-Visual Adaptive Fusion Network for Question Answering Based on Contrastive Learning

Abstract

The Audio-Visual Question Answering (AVQA) task involves extracting question-related audio-visual clues from both temporal and spatial perspectives to answer questions accurately. Despite the promising performance of existing multi-modal AVQA models, thanks to large-scale pre-trained models, challenges remain in the field. Firstly, aligning audio-visual information across temporal and spatial dimensions is difficult. Secondly, the fusion of audio-visual information is often weighted inadequately, limiting model performance. To address the above issues, we design the Audio-Visual Adaptive Fusion Network (AVAF-Net), which uses contrastive learning to align audio-visual information temporally and spatially and adaptively adjusts fusion weights based on the question. Specifically, we initially align visual and audio information temporally through a temporal-alignment contrastive loss. This is followed by an audio-visual clue-mining module that highlights question-related cues, aligning them with the vocal region spatially using spatial alignment contrastive loss. Additionally, a question-oriented adaptive fusion module assigns different weights to audio and visual modalities based on the question content and then fuses them. The fused audio-visual cues are finally used to predict the answer. Extensive experiments on the MUSIC-AVQA dataset show that AVAF-Net surpasses all baseline models, with a maximum improvement of 15.90% in average accuracy and an average improvement of 9.80%.

Cite

Text

Zhao et al. "Audio-Visual Adaptive Fusion Network for Question Answering Based on Contrastive Learning." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I10.33138

Markdown

[Zhao et al. "Audio-Visual Adaptive Fusion Network for Question Answering Based on Contrastive Learning." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zhao2025aaai-audio/) doi:10.1609/AAAI.V39I10.33138

BibTeX

@inproceedings{zhao2025aaai-audio,
  title     = {{Audio-Visual Adaptive Fusion Network for Question Answering Based on Contrastive Learning}},
  author    = {Zhao, Xujian and Wang, Yixin and Jin, Peiquan},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {10483-10491},
  doi       = {10.1609/AAAI.V39I10.33138},
  url       = {https://mlanthology.org/aaai/2025/zhao2025aaai-audio/}
}