LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading
Abstract
Lip-to-speech involves generating a natural-sounding speech synchronized with a soundless video of a person talking. Despite recent advances, current methods still cannot produce high-quality speech with high levels of intelligibility for challenging and realistic datasets such as LRS3. In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained automatic speech recognition (ASR ) serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary. Moreover, our experiments show that the inclusion of the text modality plays a major role in the intelligibility of the produced speech, readily perceptible while listening, and is empirically reflected in the substantial reduction of the word error rate ( WER ) metric. We demonstrate the effectiveness of LipVoicer through human evaluation, which shows that it produces more natural and synchronized speech signals compared to competing methods. Finally, we created a demo showcasing LipVoicer’s superiority in producing natural, synchronized, and intelligible speech, providing additional evidence of its effectiveness. Project page and code: https://github.com/yochaiye/LipVoicer
Cite
Text
Yemini et al. "LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading." International Conference on Learning Representations, 2024.Markdown
[Yemini et al. "LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/yemini2024iclr-lipvoicer/)BibTeX
@inproceedings{yemini2024iclr-lipvoicer,
title = {{LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading}},
author = {Yemini, Yochai and Shamsian, Aviv and Bracha, Lior and Gannot, Sharon and Fetaya, Ethan},
booktitle = {International Conference on Learning Representations},
year = {2024},
url = {https://mlanthology.org/iclr/2024/yemini2024iclr-lipvoicer/}
}