Speech2Face: Learning the Face Behind a Voice

Abstract

How much can we infer about a person's looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/Youtube videos of people speaking. During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly. We evaluate and numerically quantify how--and in what manner--our Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.

Cite

Text

Oh et al. "Speech2Face: Learning the Face Behind a Voice." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. doi:10.1109/CVPR.2019.00772

Markdown

[Oh et al. "Speech2Face: Learning the Face Behind a Voice." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.](https://mlanthology.org/cvpr/2019/oh2019cvpr-speech2face/) doi:10.1109/CVPR.2019.00772

BibTeX

@inproceedings{oh2019cvpr-speech2face,
  title     = {{Speech2Face: Learning the Face Behind a Voice}},
  author    = {Oh, Tae-Hyun and Dekel, Tali and Kim, Changil and Mosseri, Inbar and Freeman, William T. and Rubinstein, Michael and Matusik, Wojciech},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2019},
  doi       = {10.1109/CVPR.2019.00772},
  url       = {https://mlanthology.org/cvpr/2019/oh2019cvpr-speech2face/}
}