Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization

Bao, Peijun; Yang, Wenhan; Ng, Boon Poh; Er, Meng Hwa; Kot, Alex C.

doi:10.1609/AAAI.V37I1.25093

Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization

Peijun Bao, Wenhan Yang, Boon Poh Ng, Meng Hwa Er, Alex C. Kot

AAAI 2023 pp. 215-222

doi:10.1609/AAAI.V37I1.25093 /aaai/2023/bao2023aaai-cross/

Abstract

This paper for the first time explores audio-visual event localization in an unsupervised manner. Previous methods tackle this problem in a supervised setting and require segment-level or video-level event category ground-truth to train the model. However, building large-scale multi-modality datasets with category annotations is human-intensive and thus not scalable to real-world applications. To this end, we propose cross-modal label contrastive learning to exploit multi-modal information among unlabeled audio and visual streams as self-supervision signals. At the feature representation level, multi-modal representations are collaboratively learned from audio and visual components by using self-supervised representation learning. At the label level, we propose a novel self-supervised pretext task i.e. label contrasting to self-annotate videos with pseudo-labels for localization model training. Note that irrelevant background would hinder the acquisition of high-quality pseudo-labels and thus lead to an inferior localization model. To address this issue, we then propose an expectation-maximization algorithm that optimizes the pseudo-label acquisition and localization model in a coarse-to-fine manner. Extensive experiments demonstrate that our unsupervised approach performs reasonably well compared to the state-of-the-art supervised methods.

PDF AAAI Semantic Scholar

Cite

Text

Bao et al. "Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I1.25093

Markdown

[Bao et al. "Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/bao2023aaai-cross/) doi:10.1609/AAAI.V37I1.25093

BibTeX

@inproceedings{bao2023aaai-cross,
  title     = {{Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization}},
  author    = {Bao, Peijun and Yang, Wenhan and Ng, Boon Poh and Er, Meng Hwa and Kot, Alex C.},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {215-222},
  doi       = {10.1609/AAAI.V37I1.25093},
  url       = {https://mlanthology.org/aaai/2023/bao2023aaai-cross/}
}