DiffTED: One-Shot Audio-Driven TED Talk Video Generation with Diffusion-Based Co-Speech Gestures

Abstract

Audio-driven talking video generation has advanced significantly, but existing methods often depend on video-to-video translation techniques and traditional generative networks like GANs and they typically generate taking heads and co-speech gestures separately, leading to less coherent outputs. Furthermore, the gestures produced by these methods often appear overly smooth or subdued, lacking in diversity, and many gesture-centric approaches do not integrate talking head generation. To address these limitations, we introduce DiffTED, a new approach for one-shot audio-driven TED-style talking video generation from a single image. Specifically, we leverage a diffusion model to generate sequences of keypoints for a Thin-Plate Spline motion model, precisely controlling the avatar’s animation while ensuring temporally coherent and diverse gestures. This innovative approach utilizes classifier-free guidance, empowering the gestures to flow naturally with the audio input without relying on pre-trained classifiers. Experiments demonstrate that DiffTED generates temporally coherent talking videos with diverse co-speech gestures.

Cite

Text

Hogue et al. "DiffTED: One-Shot Audio-Driven TED Talk Video Generation with Diffusion-Based Co-Speech Gestures." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00198

Markdown

[Hogue et al. "DiffTED: One-Shot Audio-Driven TED Talk Video Generation with Diffusion-Based Co-Speech Gestures." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/hogue2024cvprw-diffted/) doi:10.1109/CVPRW63382.2024.00198

BibTeX

@inproceedings{hogue2024cvprw-diffted,
  title     = {{DiffTED: One-Shot Audio-Driven TED Talk Video Generation with Diffusion-Based Co-Speech Gestures}},
  author    = {Hogue, Steven and Zhang, Chenxu and Daruger, Hamza and Tian, Yapeng and Guo, Xiaohu},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {1922-1931},
  doi       = {10.1109/CVPRW63382.2024.00198},
  url       = {https://mlanthology.org/cvprw/2024/hogue2024cvprw-diffted/}
}