Mix and Localize: Localizing Sound Sources in Mixtures
Abstract
We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds each correspond to nodes, and train a random walker to transition between nodes from different modalities with high return probability. The transition probabilities for this walk are determined by an audio-visual similarity metric that is learned by our model. We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds, outperforming other self-supervised methods.
Cite
Text
Hu et al. "Mix and Localize: Localizing Sound Sources in Mixtures." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01023Markdown
[Hu et al. "Mix and Localize: Localizing Sound Sources in Mixtures." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/hu2022cvpr-mix/) doi:10.1109/CVPR52688.2022.01023BibTeX
@inproceedings{hu2022cvpr-mix,
title = {{Mix and Localize: Localizing Sound Sources in Mixtures}},
author = {Hu, Xixi and Chen, Ziyang and Owens, Andrew},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2022},
pages = {10483-10492},
doi = {10.1109/CVPR52688.2022.01023},
url = {https://mlanthology.org/cvpr/2022/hu2022cvpr-mix/}
}