CPM: Class-Conditional Prompting Machine for Audio-Visual Segmentation

Abstract

Audio-visual segmentation (AVS) is an emerging task that aims to accurately segment sounding objects based on audio-visual cues. The success of AVS learning systems depends on the effectiveness of cross-modal interaction. Such a requirement can be naturally fulfilled by leveraging transformer-based segmentation architecture due to its inherent ability to capture long-range dependencies and flexibility in handling different modalities. However, the inherent training issues of transformer-based methods, such as the low efficacy of cross-attention and unstable bipartite matching, can be amplified in AVS, particularly when the learned audio query does not provide a clear semantic clue. In this paper, we address these two issues with the new Class-conditional Prompting Machine (CPM). CPM improves the bipartite matching with a learning strategy combining class-agnostic queries with class-conditional queries. The efficacy of cross-modal attention is upgraded with new learning objectives for the audio, visual and joint modalities. We conduct experiments on AVS benchmarks, demonstrating that our method achieves state-of-the-art (SOTA) segmentation accuracy1 . 1 This project is supported by the Australian Research Council (ARC) through grant FT190100525.

Cite

Text

Chen et al. "CPM: Class-Conditional Prompting Machine for Audio-Visual Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72684-2_25

Markdown

[Chen et al. "CPM: Class-Conditional Prompting Machine for Audio-Visual Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/chen2024eccv-cpm/) doi:10.1007/978-3-031-72684-2_25

BibTeX

@inproceedings{chen2024eccv-cpm,
  title     = {{CPM: Class-Conditional Prompting Machine for Audio-Visual Segmentation}},
  author    = {Chen, Yuanhong and Wang, Chong and Liu, Yuyuan and Wang, Hu and Carneiro, Gustavo},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72684-2_25},
  url       = {https://mlanthology.org/eccv/2024/chen2024eccv-cpm/}
}