Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing

Xu, Yating; Hu, Conghui; Lee, Gim Hee

Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing

WACV 2024 pp. 5615-5624

/wacv/2024/xu2024wacv-rethink/

Abstract

Existing works on weakly-supervised audio-visual video parsing adopt hybrid attention network (HAN) as the multi-modal embedding to capture the cross-modal context. It embeds the audio and visual modalities with a shared network, where the cross-attention is performed at the input. However, such an early fusion method highly entangles the two non-fully correlated modalities and leads to sub-optimal performance in detecting single-modality events. To deal with this problem, we propose the messenger-guided mid-fusion transformer to reduce the uncorrelated cross-modal context in the fusion. The messengers condense the full cross-modal context into a compact representation to only preserve useful cross-modal information. Furthermore, due to the fact that microphones capture audio events from all directions, while cameras only record visual events within a restricted field of view, there is a more frequent occurrence of unaligned cross-modal context from audio streams for visual event predictions. We thus propose cross-audio prediction consistency to suppress the impact of irrelevant audio information on visual event prediction. Experiments consistently illustrate the superior performance of our framework compared to existing state-of-the-art methods.

PDF WACV Semantic Scholar

Cite

Text

Xu et al. "Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing." Winter Conference on Applications of Computer Vision, 2024.

Markdown

[Xu et al. "Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing." Winter Conference on Applications of Computer Vision, 2024.](https://mlanthology.org/wacv/2024/xu2024wacv-rethink/)

BibTeX

@inproceedings{xu2024wacv-rethink,
  title     = {{Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing}},
  author    = {Xu, Yating and Hu, Conghui and Lee, Gim Hee},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2024},
  pages     = {5615-5624},
  url       = {https://mlanthology.org/wacv/2024/xu2024wacv-rethink/}
}