FLAM: Frame-Wise Language-Audio Modeling

ICML 2025 pp. 67719-67740

Abstract

Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.

Cite

Text

Wu et al. "FLAM: Frame-Wise Language-Audio Modeling." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Wu et al. "FLAM: Frame-Wise Language-Audio Modeling." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/wu2025icml-flam/)

BibTeX

@inproceedings{wu2025icml-flam,
  title     = {{FLAM: Frame-Wise Language-Audio Modeling}},
  author    = {Wu, Yusong and Tsirigotis, Christos and Chen, Ke and Huang, Cheng-Zhi Anna and Courville, Aaron and Nieto, Oriol and Seetharaman, Prem and Salamon, Justin},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {67719-67740},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/wu2025icml-flam/}
}