Responsive Listening Head Generation: A Benchmark Dataset and Baseline
Abstract
We present a new listening head generation benchmark, for synthesizing responsive feedbacks of a listener (e.g., nod, smile) during a face-to-face conversation. As the indispensable complement to talking heads generation, listening head generation has seldomly been studied in literature. Automatically synthesizing listening behavior that actively responds to a talking head, is critical to applications such as digital human, virtual agents and social robots. In this work, we propose a novel dataset ""ViCo"", highlighting the listening head generation during a face-to-face conversation. A total number of 92 identities (67 speakers and 76 listeners) are involved in ViCo, featuring 483 clips in a paired ""speaking-listening"" pattern, where listeners show three listening styles based on their attitudes: positive, neutral, negative. Different from traditional speech-to-gesture or talking-head generation, listening head generation takes as input both the audio and visual signals from the speaker, and gives non-verbal feedbacks (e.g., head motions, facial expressions) in a real-time manner. Our dataset supports a wide range of applications such as human-to-human interaction, video-to-video translation, cross-modal understanding and generation. To encourage further research, we also release a listening head generation baseline, conditioning on different listening attitudes. Code & ViCo dataset: https://project.mhzhou.com/vico.
Cite
Text
Zhou et al. "Responsive Listening Head Generation: A Benchmark Dataset and Baseline." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19839-7_8Markdown
[Zhou et al. "Responsive Listening Head Generation: A Benchmark Dataset and Baseline." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/zhou2022eccv-responsive/) doi:10.1007/978-3-031-19839-7_8BibTeX
@inproceedings{zhou2022eccv-responsive,
title = {{Responsive Listening Head Generation: A Benchmark Dataset and Baseline}},
author = {Zhou, Mohan and Bai, Yalong and Zhang, Wei and Yao, Ting and Zhao, Tiejun and Mei, Tao},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022},
doi = {10.1007/978-3-031-19839-7_8},
url = {https://mlanthology.org/eccv/2022/zhou2022eccv-responsive/}
}