Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing

Abstract

The audio-visual video parsing task aims to temporally parse a video into audio or visual event categories. However, it is labor intensive to temporally annotate audio and visual events and thus hampers the learning of a parsing model. To this end, we propose to explore additional cross-video and cross-modality supervisory signals to facilitate weakly-supervised audio-visual video parsing. The proposed method exploits both the common and diverse event semantics across videos to identify audio or visual events. In addition, our method explores event co-occurrence across audio, visual, and audio-visual streams. We leverage the explored cross-modality co-occurrence to localize segments of target events while excluding irrelevant ones. The discovered supervisory signals across different videos and modalities can greatly facilitate the training with only video-level annotations. Quantitative and qualitative results demonstrate that the proposed method performs favorably against existing methods on weakly-supervised audio-visual video parsing.

Cite

Text

Lin et al. "Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing." Neural Information Processing Systems, 2021.

Markdown

[Lin et al. "Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/lin2021neurips-exploring/)

BibTeX

@inproceedings{lin2021neurips-exploring,
  title     = {{Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing}},
  author    = {Lin, Yan-Bo and Tseng, Hung-Yu and Lee, Hsin-Ying and Lin, Yen-Yu and Yang, Ming-Hsuan},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/lin2021neurips-exploring/}
}