Audio—Visual Segmentation

Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong

ECCV 2022

doi:10.1007/978-3-031-19836-6 /eccv/2022/zhou2022eccv-audiovisual/

Abstract

We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos. Two settings are studied with this benchmark: 1) semi-supervised audio-visual segmentation with a single sound source and 2) fully-supervised audio-visual segmentation with multiple sound sources. To deal with the AVS problem, we propose a new method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage the audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench compare our approach to several existing methods from related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench.

PDF ECCV Semantic Scholar

Cite

Text

Zhou et al. "Audio—Visual Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19836-6

Markdown

[Zhou et al. "Audio—Visual Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/zhou2022eccv-audiovisual/) doi:10.1007/978-3-031-19836-6

BibTeX

@inproceedings{zhou2022eccv-audiovisual,
  title     = {{Audio—Visual Segmentation}},
  author    = {Zhou, Jinxing and Wang, Jianyuan and Zhang, Jiayi and Sun, Weixuan and Zhang, Jing and Birchfield, Stan and Guo, Dan and Kong, Lingpeng and Wang, Meng and Zhong, Yiran},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19836-6},
  url       = {https://mlanthology.org/eccv/2022/zhou2022eccv-audiovisual/}
}