Weakly-Supervised Audio-Visual Segmentation

Abstract

Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotation, $\textit{i.e.}$, weakly-supervised audio-visual segmentation. We present a novel Weakly-Supervised Audio-Visual Segmentation framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-scale multiple-instance contrastive learning for audio-visual segmentation. Extensive experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Mo and Raj. "Weakly-Supervised Audio-Visual Segmentation." Neural Information Processing Systems, 2023.

Markdown

[Mo and Raj. "Weakly-Supervised Audio-Visual Segmentation." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/mo2023neurips-weaklysupervised/)

BibTeX

@inproceedings{mo2023neurips-weaklysupervised,
  title     = {{Weakly-Supervised Audio-Visual Segmentation}},
  author    = {Mo, Shentong and Raj, Bhiksha},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/mo2023neurips-weaklysupervised/}
}