TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation
Abstract
Referring audio-visual segmentation (Ref-AVS) aims to segment objects within audio-visual scenes using multimodal cues embedded in text expressions. While the Segment Anything Model (SAM) has revolutionized visual segmentation, its applicability to Ref-AVS, where multimodal cues act as novel prompts, remains unexplored. SAM's limitation to single-frame segmentation also hinders its ability to capture essential temporal context needed for multi-frame audio-visual segmentation. To address this gap, we propose TSAM, a novel extension of SAM designed to leverage multimodal cues for precise segmentation in dynamic audio-visual scenes. TSAM enhances SAM's image encoder with a temporal modeling branch, enabling spatio-temporal learning and deep multimodal fusion across video frames, while retaining SAM's pre-trained knowledge. Additionally, TSAM replaces SAM's user-interactive prompting mechanism with sparse and dense data-driven prompts, enabling more effective integration of audio-visual inputs and reference text expressions. Extensive experiments on the Ref-AVS dataset demonstrate TSAM's superiority over state-of-the-art methods. The results illustrate its effectiveness in segmenting objects in dynamic audio-visual scenes using text-based multimodal cues and its strong generalization to unseen objects.
Cite
Text
Radman and Laaksonen. "TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02230Markdown
[Radman and Laaksonen. "TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/radman2025cvpr-tsam/) doi:10.1109/CVPR52734.2025.02230BibTeX
@inproceedings{radman2025cvpr-tsam,
title = {{TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation}},
author = {Radman, Abduljalil and Laaksonen, Jorma},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {23947-23956},
doi = {10.1109/CVPR52734.2025.02230},
url = {https://mlanthology.org/cvpr/2025/radman2025cvpr-tsam/}
}