Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis

Abstract

Since facial actions such as lip movements contain significant information about speech content, it is not surprising that audio-visual speech enhancement methods are more accurate than their audio-only counterparts. Yet, state-of-the-art approaches still struggle to generate clean, realistic speech without noise artifacts and unnatural distortions in challenging acoustic environments. In this paper, we propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR. Our approach leverages audio-visual speech cues to generate the codes of a neural speech codec, enabling efficient synthesis of clean, realistic speech from noisy signals. Given the importance of speaker-specific cues in speech, we focus on developing personalized models that work well for individual speakers. We demonstrate the efficacy of our approach on a new audio-visual speech dataset collected in an unconstrained, large vocabulary setting, as well as existing audio-visual datasets, outperforming speech enhancement baselines on both quantitative metrics and human evaluation studies. Please see the supplemental video for qualitative results.

Cite

Text

Yang et al. "Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00805

Markdown

[Yang et al. "Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/yang2022cvpr-audiovisual/) doi:10.1109/CVPR52688.2022.00805

BibTeX

@inproceedings{yang2022cvpr-audiovisual,
  title     = {{Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis}},
  author    = {Yang, Karren and Marković, Dejan and Krenn, Steven and Agrawal, Vasu and Richard, Alexander},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {8227-8237},
  doi       = {10.1109/CVPR52688.2022.00805},
  url       = {https://mlanthology.org/cvpr/2022/yang2022cvpr-audiovisual/}
}