Hear the Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization

Abstract

Learning to localize the sound source in videos without explicit annotations is a novel area of audio-visual research. Existing work in this area focuses on creating attention maps to capture the correlation between the two modalities to localize the source of the sound. In a video, oftentimes, the objects exhibiting movement are the ones generating the sound. In this work, we capture this characteristic by modeling the optical flow in a video as a prior to better aid in localizing the sound source. We further demonstrate that the addition of flow-based attention substantially improves visual sound source localization. Finally, we benchmark our method on standard sound source localization datasets and achieve state-of-the-art performance on the Soundnet Flickr and VGG Sound Source datasets. Code: https://github.com/denfed/heartheflow.

Cite

Text

Fedorishin et al. "Hear the Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization." Winter Conference on Applications of Computer Vision, 2023.

Markdown

[Fedorishin et al. "Hear the Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization." Winter Conference on Applications of Computer Vision, 2023.](https://mlanthology.org/wacv/2023/fedorishin2023wacv-hear/)

BibTeX

@inproceedings{fedorishin2023wacv-hear,
  title     = {{Hear the Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization}},
  author    = {Fedorishin, Dennis and Mohan, Deen Dayal and Jawade, Bhavin and Setlur, Srirangaraj and Govindaraju, Venu},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2023},
  pages     = {2278-2287},
  url       = {https://mlanthology.org/wacv/2023/fedorishin2023wacv-hear/}
}