Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

Lai, Yung-Hsuan; Chen, Yen-Chun; Wang, Frank

Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

Yung-Hsuan Lai, Yen-Chun Chen, Frank Wang

NeurIPS 2023

/neurips/2023/lai2023neurips-modalityindependent/

Abstract

Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its $\textit{modality-aligned}$ setting, $\textit{i.e.}$, the audio and visual modality are $\textit{both}$ assumed to signal the prediction target. With the Look, Listen, and Parse dataset (LLP), we investigate the under-explored $\textit{unaligned}$ setting, where the goal is to recognize audio and visual events in a video with only weak labels observed. Such weak video-level labels only tell what events happen without knowing the modality they are perceived (audio, visual, or both).To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers. A simple, effective, and generic method, termed $\textbf{V}$isual-$\textbf{A}$udio $\textbf{L}$abel Elab$\textbf{or}$ation (VALOR), is innovated to harvest modality labels for the training events. Empirical studies show that the harvested labels significantly improve an attentional baseline by $\textbf{8.0}$ in average F-score (Type@AV).Surprisingly, we found that modality-independent teachers outperform their modality-fused counterparts since they are noise-proof from the other potentially unaligned modality. Moreover, our best model achieves the new state-of-the-art on all metrics of LLP by a substantial margin ($\textbf{+5.4}$ F-score for Type@AV). VALOR is further generalized to Audio-Visual Event Localization and achieves the new state-of-the-art as well.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Lai et al. "Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser." Neural Information Processing Systems, 2023.

Markdown

[Lai et al. "Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/lai2023neurips-modalityindependent/)

BibTeX

@inproceedings{lai2023neurips-modalityindependent,
  title     = {{Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser}},
  author    = {Lai, Yung-Hsuan and Chen, Yen-Chun and Wang, Frank},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/lai2023neurips-modalityindependent/}
}