VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

Peng, Zhiliang; Yu, Jianwei; Wang, Wenhui; Chang, Yaoyao; Sun, Yutao; Dong, Li; Zhu, Yi; Xu, Weijiang; Bao, Hangbo; Wang, Zehua; Huang, Shaohan; Xia, Yan; Wei, Furu

VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei

ICLR 2026

/iclr/2026/peng2026iclr-vibevoice/

Abstract

Generating long-form, multi-speaker conversational audio like podcasts poses significant challenges for traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. We present VibeVoice , a novel model designed to synthesize expressive, long-form speech with multiple speakers in a zero-shot manner. A core component of our approach is the continuous speech tokenizers operating at an ultra-low frame rate of 7.5. This tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. To facilitate training on authentic conversational dynamics, we have developed an annotation pipeline that generates pseudo transcriptions and turn-taking labels for extensive podcast data. Leveraging this data and our efficient tokenizer, VibeVoice employs the next-token diffusion framework. This enables VibeVoice to: (1) synthesize long-form speech (up to 30 minutes) with up to 4 speakers, surpassing the typical 1-2 speaker limits of many prior models; and (2) achieve a high degree of naturalness in turn-taking, pacing, and the rendition of subtle non-lexical cues (such as breaths and lip smacks), which are crucial for listener immersion and capturing the authentic vibe of expressive conversations.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Peng et al. "VibeVoice: Expressive Podcast Generation with Next-Token Diffusion." International Conference on Learning Representations, 2026.

Markdown

[Peng et al. "VibeVoice: Expressive Podcast Generation with Next-Token Diffusion." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/peng2026iclr-vibevoice/)

BibTeX

@inproceedings{peng2026iclr-vibevoice,
  title     = {{VibeVoice: Expressive Podcast Generation with Next-Token Diffusion}},
  author    = {Peng, Zhiliang and Yu, Jianwei and Wang, Wenhui and Chang, Yaoyao and Sun, Yutao and Dong, Li and Zhu, Yi and Xu, Weijiang and Bao, Hangbo and Wang, Zehua and Huang, Shaohan and Xia, Yan and Wei, Furu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/peng2026iclr-vibevoice/}
}