Joint Time–frequency Scattering-Enhanced Representation for Bird Vocalization Classification

Abstract

Neural Networks (NNs) have been widely used in passive acoustic monitoring. Typically, audio is converted into a Mel Spectrogram as a preprocessing step before being fed into NNs. In this study, we investigate the Joint Time-Frequency Scattering transform as an alternative preprocessing technique for analyzing bird vocalizations. We highlight its superiority over the Mel Spectrogram because it captures intricate time-frequency patterns and emphasizes rapid signal transitions. While the Mel Spectrogram often gives similar importance to all sounds, the scattering transform differentiates between rapid and slow variations better. We use a Convolution Neural Network architecture and an attention-based transformer. Our results demonstrate that both the NN architectures can benefit from this enhanced preprocessing, where scattering transform can provide a more discriminative representation of bird vocalizations than the traditional Mel Spectrogram.

Cite

Text

Min et al. "Joint Time–frequency Scattering-Enhanced Representation for Bird Vocalization Classification." NeurIPS 2023 Workshops: CompSust, 2023.

Markdown

[Min et al. "Joint Time–frequency Scattering-Enhanced Representation for Bird Vocalization Classification." NeurIPS 2023 Workshops: CompSust, 2023.](https://mlanthology.org/neuripsw/2023/min2023neuripsw-joint/)

BibTeX

@inproceedings{min2023neuripsw-joint,
  title     = {{Joint Time–frequency Scattering-Enhanced Representation for Bird Vocalization Classification}},
  author    = {Min, Yimeng and Miller, Eliot T and Fink, Daniel and Gomes, Carla P},
  booktitle = {NeurIPS 2023 Workshops: CompSust},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/min2023neuripsw-joint/}
}