Learning to Separate Object Sounds by Watching Unlabeled Video

Abstract

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to study audio source separation in large-scale general "in the wild" videos. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising.

Cite

Text

Gao et al. "Learning to Separate Object Sounds by Watching Unlabeled Video." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018.

Markdown

[Gao et al. "Learning to Separate Object Sounds by Watching Unlabeled Video." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018.](https://mlanthology.org/cvprw/2018/gao2018cvprw-learning/)

BibTeX

@inproceedings{gao2018cvprw-learning,
  title     = {{Learning to Separate Object Sounds by Watching Unlabeled Video}},
  author    = {Gao, Ruohan and Feris, Rogério Schmidt and Grauman, Kristen},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2018},
  pages     = {2496-2499},
  url       = {https://mlanthology.org/cvprw/2018/gao2018cvprw-learning/}
}