AlignNet: A Unifying Approach to Audio-Visual Alignment
Abstract
We present AlignNet, a model that synchronizes videos with reference audios under non-uniform and irregular misalignments. AlignNet learns the end-to-end dense correspondence between each frame of a video and an audio. Our method is designed according to simple and well-established principles: attention, pyramidal processing, warping, and affinity function. Together with the model, we release a dancing dataset Dance50 for training and evaluation. Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that our method far outperforms the state-of-the-art methods. Code, dataset and sample videos are available at our project page.
Cite
Text
Wang et al. "AlignNet: A Unifying Approach to Audio-Visual Alignment." Winter Conference on Applications of Computer Vision, 2020.Markdown
[Wang et al. "AlignNet: A Unifying Approach to Audio-Visual Alignment." Winter Conference on Applications of Computer Vision, 2020.](https://mlanthology.org/wacv/2020/wang2020wacv-alignnet/)BibTeX
@inproceedings{wang2020wacv-alignnet,
title = {{AlignNet: A Unifying Approach to Audio-Visual Alignment}},
author = {Wang, Jianren and Fang, Zhaoyuan and Zhao, Hang},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2020},
url = {https://mlanthology.org/wacv/2020/wang2020wacv-alignnet/}
}