Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks
Abstract
Audio-Guided video semantic segmentation is a challenging problem in visual analysis and editing, which automatically separates foreground objects from background in a video sequence according to the referring audio expressions. However, the existing referring video semantic segmentation works mainly focus on the guidance of text-based referring expressions, due to the lack of modeling the semantic representation of audio-video interaction contents. In this paper, we consider the problem of audio-guided video semantic segmentation from the viewpoint of end-to-end denoised encoder-decoder network learning. We propose the walvelet-based encoder network to learn the crossmodal representations of the video contents with audio-form queries. Specifically, we adopt a multi-head cross-modal attention to explore the potential relations of video and query contents. A 2-dimension discrete wavelet transform is employed to decompose the audio-video features. We quantify the thresholds of high frequency coefficients to filter the noise and outliers. Then, a self attention-free decoder network is developed to generate the target masks with frequency domain transforms. Moreover, we maximize mutual information between the encoded features and multi-modal features after cross-modal attention to enhance the audio guidance. In addition, we construct the first large-scale audio-guided video semantic segmentation dataset. The extensive experiments show the effectiveness of our method.
Cite
Text
Pan et al. "Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks." Conference on Computer Vision and Pattern Recognition, 2022.Markdown
[Pan et al. "Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/pan2022cvpr-wnet/)BibTeX
@inproceedings{pan2022cvpr-wnet,
title = {{Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks}},
author = {Pan, Wenwen and Shi, Haonan and Zhao, Zhou and Zhu, Jieming and He, Xiuqiang and Pan, Zhigeng and Gao, Lianli and Yu, Jun and Wu, Fei and Tian, Qi},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2022},
pages = {1320-1331},
url = {https://mlanthology.org/cvpr/2022/pan2022cvpr-wnet/}
}