Exploiting Visual Context Semantics for Sound Source Localization

Abstract

Self-supervised sound source localization in unconstrained visual scenes is an important task of audio-visual learning. In this paper, we propose a visual reasoning module to explicitly exploit the rich visual context semantics, which alleviates the issue of insufficient utilization of visual information in previous works. The learning objectives are carefully designed to provide stronger supervision signals for the extracted visual semantics while enhancing the audio-visual interactions, which lead to more robust feature representations. Extensive experimental results demonstrate that our approach significantly boosts the localization performances on various datasets, even without initializations pretrained on ImageNet. Moreover, with the visual context exploitation, our framework can accomplish both the audio-visual and purely visual inference, which expands the application scope of the sound source localization task and further raises the competitiveness of our approach.

Cite

Text

Zhou et al. "Exploiting Visual Context Semantics for Sound Source Localization." Winter Conference on Applications of Computer Vision, 2023.

Markdown

[Zhou et al. "Exploiting Visual Context Semantics for Sound Source Localization." Winter Conference on Applications of Computer Vision, 2023.](https://mlanthology.org/wacv/2023/zhou2023wacv-exploiting/)

BibTeX

@inproceedings{zhou2023wacv-exploiting,
  title     = {{Exploiting Visual Context Semantics for Sound Source Localization}},
  author    = {Zhou, Xinchi and Zhou, Dongzhan and Hu, Di and Zhou, Hang and Ouyang, Wanli},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2023},
  pages     = {5199-5208},
  url       = {https://mlanthology.org/wacv/2023/zhou2023wacv-exploiting/}
}