Visually Guided Sound Source Separation and Localization Using Self-Supervised Motion Representations
Abstract
In this paper, we perform audio-visual sound source separation, i.e. to separate component audios from a mixture based on the videos of sound sources. Moreover, we aim to pinpoint the source location in the input video sequence. Recent works have shown impressive audio-visual separation results when using prior knowledge of the source type (e.g. human playing instrument) and pre-trained motion detectors (e.g. keypoints or optical flows). However, at the same time, the models are limited to a certain application domain. In this paper, we address these limitations and make the following contributions: i) we propose a two-stage architecture, called Appearance and Motion network (AMnet), where the stages specialise to appearance and motion cues, respectively. The entire system is trained in a self-supervised manner; ii) we introduce an Audio-Motion Embedding (AME) framework to explicitly represent the motions that related to sound; iii) we propose an audio-motion transformer architecture for audio and motion feature fusion; iv) we demonstrate state-of-the-art performance on two challenging datasets (MUSIC-21 and AVE) despite the fact that we do not use any pre-trained keypoint detectors or optical flow estimators. Project page: https://ly-zhu.github.io/self-supervised-motion-representations
Cite
Text
Zhu and Rahtu. "Visually Guided Sound Source Separation and Localization Using Self-Supervised Motion Representations." Winter Conference on Applications of Computer Vision, 2022.Markdown
[Zhu and Rahtu. "Visually Guided Sound Source Separation and Localization Using Self-Supervised Motion Representations." Winter Conference on Applications of Computer Vision, 2022.](https://mlanthology.org/wacv/2022/zhu2022wacv-visually/)BibTeX
@inproceedings{zhu2022wacv-visually,
title = {{Visually Guided Sound Source Separation and Localization Using Self-Supervised Motion Representations}},
author = {Zhu, Lingyu and Rahtu, Esa},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2022},
pages = {1289-1299},
url = {https://mlanthology.org/wacv/2022/zhu2022wacv-visually/}
}