Sound Localization by Self-Supervised Time Delay Estimation
Abstract
Sounds reach one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their directions. Estimating a sound’s time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive random walk of Jabri et al. to learn a cycle-consistent representation from unlabeled stereo sounds, resulting in a model that performs on par with supervised methods on ""in the wild"" internet recordings. We also propose a multimodal contrastive learning model that solves a visually-guided localization task: estimating the time delay for a particular person in a multi-speaker mixture, given a visual representation of their face.
Cite
Text
Chen et al. "Sound Localization by Self-Supervised Time Delay Estimation." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19809-0_28Markdown
[Chen et al. "Sound Localization by Self-Supervised Time Delay Estimation." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/chen2022eccv-sound/) doi:10.1007/978-3-031-19809-0_28BibTeX
@inproceedings{chen2022eccv-sound,
title = {{Sound Localization by Self-Supervised Time Delay Estimation}},
author = {Chen, Ziyang and Fouhey, David F. and Owens, Andrew},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022},
doi = {10.1007/978-3-031-19809-0_28},
url = {https://mlanthology.org/eccv/2022/chen2022eccv-sound/}
}