SoundLoc3D: Invisible 3D Sound Source Localization and Classification Using a Multimodal RGB-D Acoustic Camera

Abstract

Accurately localizing 3D sound sources and estimating their semantic labels - where the sources may not be visible but are assumed to lie on the physical surface of objects in the scene - have many real applications including detecting gas leak and machinery malfunction. The audio-visual weak- correlation in such setting poses new challenges in deriving innovative methods to answer if or how we can use cross- modal information to solve the task. Towards this end we propose to use an acoustic-camera rig consisting of a pinhole RGB-D camera and a coplanar four-channel microphone array (Mic-Array). By using this rig to record audio-visual signals from multiviews we can use the cross-modal cues to estimate the sound sources 3D locations. Specifically our framework SoundLoc3D treats the task as a set prediction problem each element in the set corresponds to a potential sound source. Given the audio-visual weak-correlation the set representation is initially learned from a single view microphone array signal and then refined by actively incorporating physical surface cues revealed from multiview RGB-D images. We demonstrate the efficiency and superiority of SoundLoc3D on large-scale simulated dataset and further show its robustness to RGB-D measurement inaccuracy and ambient noise interference.

Cite

Text

He et al. "SoundLoc3D: Invisible 3D Sound Source Localization and Classification Using a Multimodal RGB-D Acoustic Camera." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[He et al. "SoundLoc3D: Invisible 3D Sound Source Localization and Classification Using a Multimodal RGB-D Acoustic Camera." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/he2025wacv-soundloc3d/)

BibTeX

@inproceedings{he2025wacv-soundloc3d,
  title     = {{SoundLoc3D: Invisible 3D Sound Source Localization and Classification Using a Multimodal RGB-D Acoustic Camera}},
  author    = {He, Yuhang and Shin, Sangyun and Cherian, Anoop and Trigoni, Niki and Markham, Andrew},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {5408-5418},
  url       = {https://mlanthology.org/wacv/2025/he2025wacv-soundloc3d/}
}