Self-Supervised Segmentation and Source Separation on Videos

Abstract

Semantic segmentation of images [11, 3] and sound source separation in audio [8, 4, 1] are two important and popular tasks in the computer vision and computational audition communities. Traditional approaches have relied on large, labeled datasets, but recent work has leveraged the natural correspondence between vision and sound to apply supervised learning without explicit labels. In this paper, we develop a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision. The model is an extension of recently proposed work that maps image pixels to sounds [9]. This paper is a workshop edit of Rouditchenko et al. 2019 [5].

Cite

Text

Rouditchenko et al. "Self-Supervised Segmentation and Source Separation on Videos." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019.

Markdown

[Rouditchenko et al. "Self-Supervised Segmentation and Source Separation on Videos." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019.](https://mlanthology.org/cvprw/2019/rouditchenko2019cvprw-selfsupervised/)

BibTeX

@inproceedings{rouditchenko2019cvprw-selfsupervised,
  title     = {{Self-Supervised Segmentation and Source Separation on Videos}},
  author    = {Rouditchenko, Andrew and Zhao, Hang and Gan, Chuang and McDermott, Josh H. and Torralba, Antonio},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2019},
  url       = {https://mlanthology.org/cvprw/2019/rouditchenko2019cvprw-selfsupervised/}
}