Towards Audio-Visual Navigation in Noisy Environments: A Large-Scale Benchmark Dataset and an Architecture Considering Multiple Sound-Sources
Abstract
Audio-visual navigation has received considerable attention in recent years. However, the majority of related investigations have focused on single sound-source scenarios. Studies in this field for multiple sound-source scenarios remain underexplored due to the limitations of two aspects. First, the existing audio-visual navigation dataset only has limited audio samples, making it difficult to simulate diverse multiple sound-source environments. Second, existing navigation frameworks are mainly designed for single sound-source scenarios, thus their performance is severely reduced in multiple sound-source scenarios. In this work, we make an attempt to fill in these two research gaps to some extent. First, we establish a large-scale BEnchmark Dataset for Audio-Vsual Navigation, namely BeDAViN. This dataset consists of 2,258 audio samples with a total duration of 10.8 hours, which is more than 33 times longer than the existing audio dataset employed in the audio-visual navigation task. Second, we propose a new Embodied Navigation framework for MUltiple Sound-Sources Scenarios called ENMuS3. There are mainly two essential components in ENMuS3, the sound event descriptor and the multi-scale scene memory transformer. The former component equips the agent with the ability to extract spatial and semantic features of the target sound-source among multiple sound-sources, while the latter provides the ability to track the target object effectively in noisy environments. Experimental results on our BeDAViN show that ENMuS3 strongly outperforms its counterparts with a significant improvement in success rates across diverse scenarios.
Cite
Text
Shi et al. "Towards Audio-Visual Navigation in Noisy Environments: A Large-Scale Benchmark Dataset and an Architecture Considering Multiple Sound-Sources." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I14.33608Markdown
[Shi et al. "Towards Audio-Visual Navigation in Noisy Environments: A Large-Scale Benchmark Dataset and an Architecture Considering Multiple Sound-Sources." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/shi2025aaai-audio/) doi:10.1609/AAAI.V39I14.33608BibTeX
@inproceedings{shi2025aaai-audio,
title = {{Towards Audio-Visual Navigation in Noisy Environments: A Large-Scale Benchmark Dataset and an Architecture Considering Multiple Sound-Sources}},
author = {Shi, Zhanbo and Zhang, Lin and Li, Linfei and Shen, Ying},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {14673-14680},
doi = {10.1609/AAAI.V39I14.33608},
url = {https://mlanthology.org/aaai/2025/shi2025aaai-audio/}
}