What Does Your Face Sound like? 3D Face Shape Towards Voice
Abstract
Face-based speech synthesis provides a practical solution to generate voices from human faces. However, directly using 2D face images leads to the problems of uninterpretability and entanglement. In this paper, to address the issues, we introduce 3D face shape which (1) has an anatomical relationship between voice characteristics, partaking in the "bone conduction" of human timbre production, and (2) is naturally independent of irrelevant factors by excluding the blending process. We devise a three-stage framework to generate speech from 3D face shapes. Fully considering timbre production in anatomical and acquired terms, our framework incorporates three additional relevant attributes including face texture, facial features, and demographics. Experiments and subjective tests demonstrate our method can generate utterances matching faces well, with good audio quality and voice diversity. We also explore and visualize how the voice changes with the face. Case studies show that our method upgrades the face-voice inference to personalized custom-made voice creating, revealing a promising prospect in virtual human and dubbing applications.
Cite
Text
Yang et al. "What Does Your Face Sound like? 3D Face Shape Towards Voice." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I11.26628Markdown
[Yang et al. "What Does Your Face Sound like? 3D Face Shape Towards Voice." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/yang2023aaai-your/) doi:10.1609/AAAI.V37I11.26628BibTeX
@inproceedings{yang2023aaai-your,
title = {{What Does Your Face Sound like? 3D Face Shape Towards Voice}},
author = {Yang, Zhihan and Wu, Zhiyong and Shan, Ying and Jia, Jia},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2023},
pages = {13905-13913},
doi = {10.1609/AAAI.V37I11.26628},
url = {https://mlanthology.org/aaai/2023/yang2023aaai-your/}
}