FLAM: Frame-Wise Language-Audio Modeling
Abstract
Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.
Cite
Text
Wu et al. "FLAM: Frame-Wise Language-Audio Modeling." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Wu et al. "FLAM: Frame-Wise Language-Audio Modeling." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/wu2025icml-flam/)BibTeX
@inproceedings{wu2025icml-flam,
title = {{FLAM: Frame-Wise Language-Audio Modeling}},
author = {Wu, Yusong and Tsirigotis, Christos and Chen, Ke and Huang, Cheng-Zhi Anna and Courville, Aaron and Nieto, Oriol and Seetharaman, Prem and Salamon, Justin},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {67719-67740},
volume = {267},
url = {https://mlanthology.org/icml/2025/wu2025icml-flam/}
}