CAD - Contextual Multi-Modal Alignment for Dynamic AVQA

Abstract

In the context of Audio Visual Question Answering (AVQA) tasks, the audio and visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.

Cite

Text

Nadeem et al. "CAD - Contextual Multi-Modal Alignment for Dynamic AVQA." Winter Conference on Applications of Computer Vision, 2024.

Markdown

[Nadeem et al. "CAD - Contextual Multi-Modal Alignment for Dynamic AVQA." Winter Conference on Applications of Computer Vision, 2024.](https://mlanthology.org/wacv/2024/nadeem2024wacv-cad/)

BibTeX

@inproceedings{nadeem2024wacv-cad,
  title     = {{CAD - Contextual Multi-Modal Alignment for Dynamic AVQA}},
  author    = {Nadeem, Asmar and Hilton, Adrian and Dawes, Robert and Thomas, Graham and Mustafa, Armin},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2024},
  pages     = {7251-7263},
  url       = {https://mlanthology.org/wacv/2024/nadeem2024wacv-cad/}
}