VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer

Abstract

This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are available in https://ipcv.github.io/VoViT/

Cite

Text

Montesinos et al. "VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19836-6

Markdown

[Montesinos et al. "VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/montesinos2022eccv-vovit/) doi:10.1007/978-3-031-19836-6

BibTeX

@inproceedings{montesinos2022eccv-vovit,
  title     = {{VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer}},
  author    = {Montesinos, Juan F. and Kadandale, Venkatesh S. and Haro, Gloria},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19836-6},
  url       = {https://mlanthology.org/eccv/2022/montesinos2022eccv-vovit/}
}