Learning Individual Styles of Conversational Gesture

Abstract

Human speech is often accompanied by hand and arm gestures. We present a method for cross-modal translation from "in-the-wild" monologue speech of a single speaker to their conversational gesture motion. We train on unlabeled videos for which we only have noisy pseudo ground truth from an automatic pose detection system. Our proposed model significantly outperforms baseline methods in a quantitative comparison. To support research toward obtaining a computational understanding of the relationship between gesture and speech, we release a large video dataset of person-specific gestures.

Cite

Text

Ginosar et al. "Learning Individual Styles of Conversational Gesture." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. doi:10.1109/CVPR.2019.00361

Markdown

[Ginosar et al. "Learning Individual Styles of Conversational Gesture." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.](https://mlanthology.org/cvpr/2019/ginosar2019cvpr-learning/) doi:10.1109/CVPR.2019.00361

BibTeX

@inproceedings{ginosar2019cvpr-learning,
  title     = {{Learning Individual Styles of Conversational Gesture}},
  author    = {Ginosar, Shiry and Bar, Amir and Kohavi, Gefen and Chan, Caroline and Owens, Andrew and Malik, Jitendra},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2019},
  doi       = {10.1109/CVPR.2019.00361},
  url       = {https://mlanthology.org/cvpr/2019/ginosar2019cvpr-learning/}
}