Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Abstract

Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video. To distinguish the sounding objects from silent ones, both audio-visual semantic correspondence and temporal interaction are required. The previous method applies multi-frame cross-modal attention to conduct pixel-level interactions between audio features and visual features of multiple frames simultaneously, which is both redundant and implicit. In this paper, we propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information and associate each of them to particular sounding objects. Explicit object-level semantic correspondence between audio and visual modalities is established by gathering object information from visual features with predefined audio queries. Besides, an Audio-Bridged Temporal Interaction module is proposed to exchange sounding object-relevant information among multiple frames with the bridge of audio features. Extensive experiments are conducted on two AVS benchmarks to show that our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.

Cite

Text

Huang et al. "Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation." International Joint Conference on Artificial Intelligence, 2023. doi:10.24963/IJCAI.2023/97

Markdown

[Huang et al. "Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation." International Joint Conference on Artificial Intelligence, 2023.](https://mlanthology.org/ijcai/2023/huang2023ijcai-discovering/) doi:10.24963/IJCAI.2023/97

BibTeX

@inproceedings{huang2023ijcai-discovering,
  title     = {{Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation}},
  author    = {Huang, Shaofei and Li, Han and Wang, Yuqing and Zhu, Hongji and Dai, Jiao and Han, Jizhong and Rong, Wenge and Liu, Si},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {875-883},
  doi       = {10.24963/IJCAI.2023/97},
  url       = {https://mlanthology.org/ijcai/2023/huang2023ijcai-discovering/}
}