Audio-Visual Grouping Network for Sound Localization from Mixtures

Abstract

Sound source localization is a typical and challenging task that predicts the location of sound sources in a video. Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each frame. Due to the mixed property of multiple sound sources in the original space, there exist rare multi-source approaches to localizing multiple sources simultaneously, except for one recent work using a contrastive random walk in the graph with images and separated sound as nodes. Despite their promising performance, they can only handle a fixed number of sources, and they cannot learn compact class-aware representations for individual sources. To alleviate this shortcoming, in this paper, we propose a novel audio-visual grouping network, namely AVGN, that can directly learn category-wise semantic features for each source from the input audio mixture and frame to localize multiple sources simultaneously. Specifically, our AVGN leverages learnable audio-visual class tokens to aggregate class-aware source features. Then, the aggregated semantic features for each source can be used as guidance to localize the corresponding visual regions. Compared to existing multi-source methods, our new framework can localize a flexible number of sources and disentangle category-aware audio-visual representations for individual sound sources. We conduct extensive experiments on MUSIC, VGGSound-Instruments, and VGG-Sound Sources benchmarks. The results demonstrate that the proposed AVGN can achieve state-of-the-art sounding object localization performance on both single-source and multi-source scenarios.

Cite

Text

Mo and Tian. "Audio-Visual Grouping Network for Sound Localization from Mixtures." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01018

Markdown

[Mo and Tian. "Audio-Visual Grouping Network for Sound Localization from Mixtures." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/mo2023cvpr-audiovisual/) doi:10.1109/CVPR52729.2023.01018

BibTeX

@inproceedings{mo2023cvpr-audiovisual,
  title     = {{Audio-Visual Grouping Network for Sound Localization from Mixtures}},
  author    = {Mo, Shentong and Tian, Yapeng},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {10565-10574},
  doi       = {10.1109/CVPR52729.2023.01018},
  url       = {https://mlanthology.org/cvpr/2023/mo2023cvpr-audiovisual/}
}