AVE-CLIP: AudioCLIP-Based Multi-Window Temporal Transformer for Audio Visual Event Localization

Mahmud, Tanvir; Marculescu, Diana

AVE-CLIP: AudioCLIP-Based Multi-Window Temporal Transformer for Audio Visual Event Localization

WACV 2023 pp. 5158-5167

/wacv/2023/mahmud2023wacv-aveclip/

Abstract

An audio-visual event (AVE) is denoted by the correspondence of the visual and auditory signals in a video segment. Precise localization of the AVEs is very challenging since it demands effective multi-modal feature correspondence to ground the short and long range temporal interactions. Existing approaches struggle in capturing the different scales of multi-modal interaction due to ineffective multi-modal training strategies. To overcome this limitation, we introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer to effectively operate on different temporal scales of video frames. Our contributions are three-fold: (1) We introduce a multi-stage training framework to incorporate AudioCLIP pre-trained with audio-image pairs into the AVE localization task on video frames through contrastive fine-tuning, effective mean video feature extraction, and multi-scale training phases. (2) We propose a multi-domain attention mechanism that operates on both temporal and feature domains over varying timescales to fuse the local and global feature variations. (3) We introduce a temporal refining scheme with event-guided attention followed by a simple-yet-effective post processing step to handle significant variations of the background over diverse events. Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement which proves its superiority over existing approaches.

PDF WACV Semantic Scholar

Cite

Text

Mahmud and Marculescu. "AVE-CLIP: AudioCLIP-Based Multi-Window Temporal Transformer for Audio Visual Event Localization." Winter Conference on Applications of Computer Vision, 2023.

Markdown

[Mahmud and Marculescu. "AVE-CLIP: AudioCLIP-Based Multi-Window Temporal Transformer for Audio Visual Event Localization." Winter Conference on Applications of Computer Vision, 2023.](https://mlanthology.org/wacv/2023/mahmud2023wacv-aveclip/)

BibTeX

@inproceedings{mahmud2023wacv-aveclip,
  title     = {{AVE-CLIP: AudioCLIP-Based Multi-Window Temporal Transformer for Audio Visual Event Localization}},
  author    = {Mahmud, Tanvir and Marculescu, Diana},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2023},
  pages     = {5158-5167},
  url       = {https://mlanthology.org/wacv/2023/mahmud2023wacv-aveclip/}
}