Can Masked Autoencoders Also Listen to Birds?

Abstract

Masked Autoencoders (MAEs) learn rich representations in audio classification through an efficient self-supervised reconstruction task. Yet, general-purpose models struggle in fine-grained audio domains such as bird sound classification, which demands distinguishing subtle inter-species differences under high intra-species variability. We show that bridging this domain gap requires full-pipeline adaptation beyond domain-specific pretraining data. Using BirdSet, a large-scale bioacoustic benchmark, we systematically adapt pretraining, fine-tuning, and frozen feature utilization. Our Bird-MAE sets new state-of-the-art results on BirdSet’s multi-label classification benchmark. Additionally, we introduce the parameter-efficient prototypical probing, which boosts the utility of frozen MAE features by achieving up to 37 mAP points over linear probes and narrowing the gap to fine-tuning in low-resource settings. Bird-MAE also exhibits strong few-shot generalization with prototypical probes on our newly established few-shot benchmark on BirdSet, underscoring the importance of tailored self-supervised learning pipelines for fine-grained audio domains.

Cite

Text

Rauch et al. "Can Masked Autoencoders Also Listen to Birds?." Transactions on Machine Learning Research, 2025.

Markdown

[Rauch et al. "Can Masked Autoencoders Also Listen to Birds?." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/rauch2025tmlr-masked/)

BibTeX

@article{rauch2025tmlr-masked,
  title     = {{Can Masked Autoencoders Also Listen to Birds?}},
  author    = {Rauch, Lukas and Heinrich, René and Moummad, Ilyass and Joly, Alexis and Sick, Bernhard and Scholz, Christoph},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/rauch2025tmlr-masked/}
}