DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

Abstract

Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin. See the project page for more results: https://ihp-lab.github.io/DiTaiListener/

Cite

Text

Siniukov et al. "DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion." International Conference on Computer Vision, 2025.

Markdown

[Siniukov et al. "DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/siniukov2025iccv-ditailistener/)

BibTeX

@inproceedings{siniukov2025iccv-ditailistener,
  title     = {{DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion}},
  author    = {Siniukov, Maksim and Chang, Di and Tran, Minh and Gong, Hongkun and Chaubey, Ashutosh and Soleymani, Mohammad},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {11991-12001},
  url       = {https://mlanthology.org/iccv/2025/siniukov2025iccv-ditailistener/}
}