DiffSED: Sound Event Detection with Denoising Diffusion

Bhosale, Swapnil; Nag, Sauradip; Kanojia, Diptesh; Deng, Jiankang; Zhu, Xiatian

doi:10.1609/AAAI.V38I2.27837

DiffSED: Sound Event Detection with Denoising Diffusion

Swapnil Bhosale, Sauradip Nag, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu

AAAI 2024 pp. 792-800

doi:10.1609/AAAI.V38I2.27837 /aaai/2024/bhosale2024aaai-diffsed/

Abstract

Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the split-and-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the ground-truth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives, with 40+% faster convergence in training. Code: https://github.com/Surrey-UPLab/DiffSED

PDF AAAI Semantic Scholar

Cite

Text

Bhosale et al. "DiffSED: Sound Event Detection with Denoising Diffusion." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I2.27837

Markdown

[Bhosale et al. "DiffSED: Sound Event Detection with Denoising Diffusion." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/bhosale2024aaai-diffsed/) doi:10.1609/AAAI.V38I2.27837

BibTeX

@inproceedings{bhosale2024aaai-diffsed,
  title     = {{DiffSED: Sound Event Detection with Denoising Diffusion}},
  author    = {Bhosale, Swapnil and Nag, Sauradip and Kanojia, Diptesh and Deng, Jiankang and Zhu, Xiatian},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {792-800},
  doi       = {10.1609/AAAI.V38I2.27837},
  url       = {https://mlanthology.org/aaai/2024/bhosale2024aaai-diffsed/}
}