Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, Si Liu

IJCAI 2023 pp. 875-883

doi:10.24963/IJCAI.2023/97 /ijcai/2023/huang2023ijcai-discovering/

Abstract

Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video. To distinguish the sounding objects from silent ones, both audio-visual semantic correspondence and temporal interaction are required. The previous method applies multi-frame cross-modal attention to conduct pixel-level interactions between audio features and visual features of multiple frames simultaneously, which is both redundant and implicit. In this paper, we propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information and associate each of them to particular sounding objects. Explicit object-level semantic correspondence between audio and visual modalities is established by gathering object information from visual features with predefined audio queries. Besides, an Audio-Bridged Temporal Interaction module is proposed to exchange sounding object-relevant information among multiple frames with the bridge of audio features. Extensive experiments are conducted on two AVS benchmarks to show that our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.

PDF IJCAI Semantic Scholar

Cite

Text

Huang et al. "Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation." International Joint Conference on Artificial Intelligence, 2023. doi:10.24963/IJCAI.2023/97

Markdown

[Huang et al. "Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation." International Joint Conference on Artificial Intelligence, 2023.](https://mlanthology.org/ijcai/2023/huang2023ijcai-discovering/) doi:10.24963/IJCAI.2023/97

BibTeX

@inproceedings{huang2023ijcai-discovering,
  title     = {{Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation}},
  author    = {Huang, Shaofei and Li, Han and Wang, Yuqing and Zhu, Hongji and Dai, Jiao and Han, Jizhong and Rong, Wenge and Liu, Si},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {875-883},
  doi       = {10.24963/IJCAI.2023/97},
  url       = {https://mlanthology.org/ijcai/2023/huang2023ijcai-discovering/}
}