TACO: Training-Free Sound Prompted Segmentation via Semantically Constrained Audio-Visual CO-Factorization

Abstract

Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts. These concepts are passed on to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.

Cite

Text

Malard et al. "TACO: Training-Free Sound Prompted Segmentation via Semantically Constrained Audio-Visual CO-Factorization." Transactions on Machine Learning Research, 2026.

Markdown

[Malard et al. "TACO: Training-Free Sound Prompted Segmentation via Semantically Constrained Audio-Visual CO-Factorization." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/malard2026tmlr-taco/)

BibTeX

@article{malard2026tmlr-taco,
  title     = {{TACO: Training-Free Sound Prompted Segmentation via Semantically Constrained Audio-Visual CO-Factorization}},
  author    = {Malard, Hugo and Olvera, Michel and Lathuilière, Stéphane and Essid, Slim},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/malard2026tmlr-taco/}
}