Understanding Co-Speech Gestures In-the-Wild
Abstract
Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-speech-text associations: (i) gesture based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal video-gesture-speech-text representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs). Further analysis reveals that speech and text modalities capture distinct gesture related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: https://www.robots.ox.ac.uk/ vgg/research/jegal.
Cite
Text
Hegde et al. "Understanding Co-Speech Gestures In-the-Wild." International Conference on Computer Vision, 2025.Markdown
[Hegde et al. "Understanding Co-Speech Gestures In-the-Wild." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/hegde2025iccv-understanding/)BibTeX
@inproceedings{hegde2025iccv-understanding,
title = {{Understanding Co-Speech Gestures In-the-Wild}},
author = {Hegde, Sindhu B and Prajwal, K R and Kwon, Taein and Zisserman, Andrew},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {9977-9987},
url = {https://mlanthology.org/iccv/2025/hegde2025iccv-understanding/}
}