MoonCast: High-Quality Zero-Shot Podcast Generation

Abstract

Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize spontaneous podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To enable long audio generation, we employ a language model with parameter, data, and context scaling to process sequences in an innovative format designed for modeling entire multi-speaker, multi-turn speech interactions. To enhance spontaneity, we observe that ASR transcripts capture spontaneous speech details (e.g., filler words indicating hesitations, and specific punctuation and spaces reflecting breathing pauses), suggesting that these transcripts can serve as a partial indicator of speech spontaneity. Building upon this assumption, we utilize a script generation module to generate scripts incorporating these spontaneous elements. Experiments show MoonCast outperforms baselines, with notable improvements in contextual coherence and spontaneity.

Cite

Text

Ju et al. "MoonCast: High-Quality Zero-Shot Podcast Generation." Advances in Neural Information Processing Systems, 2025.

Markdown

[Ju et al. "MoonCast: High-Quality Zero-Shot Podcast Generation." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/ju2025neurips-mooncast/)

BibTeX

@inproceedings{ju2025neurips-mooncast,
  title     = {{MoonCast: High-Quality Zero-Shot Podcast Generation}},
  author    = {Ju, Zeqian and Yang, Dongchao and Shen, Kai and Leng, Yichong and Wang, Zhengtao and Liu, Songxiang and Zhou, Xinyu and Qin, Tao and Li, Xiangyang and Yu, Jianwei and Tan, Xu},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/ju2025neurips-mooncast/}
}