Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation

Abstract

Recently an audio-visual segmentation (AVS) task has been introduced aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene posing significant challenges. In this paper we propose an innovative audio-visual transformer framework termed COMBO an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time our framework explores three types of bilateral entanglements within AVS: pixel entanglement modality entanglement and temporal entanglement. Regarding pixel entanglement we employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate more precise visual features from the foundational model. For modality entanglement we design a Bilateral-Fusion Module (BFM) enabling COMBO to align corresponding visual and auditory signals bi-directionally. As for temporal entanglement we introduce an innovative adaptive inter-frame consistency loss according to the inherent rules of temporal. Comprehensive experiments and ablation studies on AVSBench-object (84.7 mIoU on S4 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods. Project page is available at https://yannqi.github.io/AVS-COMBO.

Cite

Text

Yang et al. "Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02562

Markdown

[Yang et al. "Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/yang2024cvpr-cooperation/) doi:10.1109/CVPR52733.2024.02562

BibTeX

@inproceedings{yang2024cvpr-cooperation,
  title     = {{Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation}},
  author    = {Yang, Qi and Nie, Xing and Li, Tong and Gao, Pengfei and Guo, Ying and Zhen, Cheng and Yan, Pengfei and Xiang, Shiming},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {27134-27143},
  doi       = {10.1109/CVPR52733.2024.02562},
  url       = {https://mlanthology.org/cvpr/2024/yang2024cvpr-cooperation/}
}