Discriminative Sounding Objects Localization via Self-Supervised Audiovisual Matching

Abstract

Discriminatively localizing sounding objects in cocktail-party, i.e., mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party scenarios by referring the pre-learned object knowledge, and the sounding objects are accordingly selected by matching audio and visual object category distributions, where the audiovisual consistency is viewed as the self-supervised signal. Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes. Code is available at https://github.com/DTaoo/Discriminative-Sounding-Objects-Localization.

Cite

Text

Hu et al. "Discriminative Sounding Objects Localization via Self-Supervised Audiovisual Matching." Neural Information Processing Systems, 2020.

Markdown

[Hu et al. "Discriminative Sounding Objects Localization via Self-Supervised Audiovisual Matching." Neural Information Processing Systems, 2020.](https://mlanthology.org/neurips/2020/hu2020neurips-discriminative/)

BibTeX

@inproceedings{hu2020neurips-discriminative,
  title     = {{Discriminative Sounding Objects Localization via Self-Supervised Audiovisual Matching}},
  author    = {Hu, Di and Qian, Rui and Jiang, Minyue and Tan, Xiao and Wen, Shilei and Ding, Errui and Lin, Weiyao and Dou, Dejing},
  booktitle = {Neural Information Processing Systems},
  year      = {2020},
  url       = {https://mlanthology.org/neurips/2020/hu2020neurips-discriminative/}
}