Learning to Separate Object Sounds by Watching Unlabeled Video
Abstract
Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising. Our video results: http://vision.cs.utexas.edu/projects/separating_object_sounds/
Cite
Text
Gao et al. "Learning to Separate Object Sounds by Watching Unlabeled Video." Proceedings of the European Conference on Computer Vision (ECCV), 2018. doi:10.1007/978-3-030-01219-9_3Markdown
[Gao et al. "Learning to Separate Object Sounds by Watching Unlabeled Video." Proceedings of the European Conference on Computer Vision (ECCV), 2018.](https://mlanthology.org/eccv/2018/gao2018eccv-learning/) doi:10.1007/978-3-030-01219-9_3BibTeX
@inproceedings{gao2018eccv-learning,
title = {{Learning to Separate Object Sounds by Watching Unlabeled Video}},
author = {Gao, Ruohan and Feris, Rogerio and Grauman, Kristen},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2018},
doi = {10.1007/978-3-030-01219-9_3},
url = {https://mlanthology.org/eccv/2018/gao2018eccv-learning/}
}