Mix and Localize: Localizing Sound Sources in Mixtures

Abstract

We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds each correspond to nodes, and train a random walker to transition between nodes from different modalities with high return probability. The transition probabilities for this walk are determined by an audio-visual similarity metric that is learned by our model. We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds, outperforming other self-supervised methods.

Cite

Text

Hu et al. "Mix and Localize: Localizing Sound Sources in Mixtures." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01023

Markdown

[Hu et al. "Mix and Localize: Localizing Sound Sources in Mixtures." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/hu2022cvpr-mix/) doi:10.1109/CVPR52688.2022.01023

BibTeX

@inproceedings{hu2022cvpr-mix,
  title     = {{Mix and Localize: Localizing Sound Sources in Mixtures}},
  author    = {Hu, Xixi and Chen, Ziyang and Owens, Andrew},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {10483-10492},
  doi       = {10.1109/CVPR52688.2022.01023},
  url       = {https://mlanthology.org/cvpr/2022/hu2022cvpr-mix/}
}