One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing

Abstract

We propose a neural talking-head video synthesis model and demonstrate its application to video conferencing. Our model learns to synthesize a talking-head video using a source image containing the target person's appearance and a driving video that dictates the motion in the output. Our motion is encoded based on a novel keypoint representation, where the identity-specific and motion-related information is decomposed unsupervisedly. Extensive experimental validation shows that our model outperforms competing methods on benchmark datasets. Moreover, our compact keypoint representation enables a video conferencing system that achieves the same visual quality as the commercial H.264 standard while only using one-tenth of the bandwidth. Besides, we show our keypoint representation allows the user to rotate the head during synthesis, which is useful for simulating face-to-face video conferencing experiences.

Cite

Text

Wang et al. "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00991

Markdown

[Wang et al. "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/wang2021cvpr-oneshot/) doi:10.1109/CVPR46437.2021.00991

BibTeX

@inproceedings{wang2021cvpr-oneshot,
  title     = {{One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing}},
  author    = {Wang, Ting-Chun and Mallya, Arun and Liu, Ming-Yu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {10039-10049},
  doi       = {10.1109/CVPR46437.2021.00991},
  url       = {https://mlanthology.org/cvpr/2021/wang2021cvpr-oneshot/}
}