SpeakerVid-5m: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Zhang, Youliang; Li, Zhaoyang; Wang, Duomin; Zhang, Jiahe; Zhou, Deyu; Yin, Zixin; Dai, Xili; Yu, Gang; Li, Xiu

SpeakerVid-5m: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li

ICLR 2026

/iclr/2026/zhang2026iclr-speakervid5m/

Abstract

The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over $8,743$ hours, SpeakerVid-5M contains more than $5.2$ million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark (VidChatBench) for future work. Both the dataset and the corresponding data processing code will be publicly released.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zhang et al. "SpeakerVid-5m: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation." International Conference on Learning Representations, 2026.

Markdown

[Zhang et al. "SpeakerVid-5m: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhang2026iclr-speakervid5m/)

BibTeX

@inproceedings{zhang2026iclr-speakervid5m,
  title     = {{SpeakerVid-5m: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation}},
  author    = {Zhang, Youliang and Li, Zhaoyang and Wang, Duomin and Zhang, Jiahe and Zhou, Deyu and Yin, Zixin and Dai, Xili and Yu, Gang and Li, Xiu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhang2026iclr-speakervid5m/}
}