CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-Talker Conversations

Abstract

Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. This is exemplified by instances generated in a single channel where one speaker's utterance is seamlessly mixed with another's interjections or laughter, indicating the latter's role as an attentive listener. Audio samples are enclosed in the supplementary.

Cite

Text

Zhang et al. "CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-Talker Conversations." Neural Information Processing Systems, 2024. doi:10.52202/079017-3183

Markdown

[Zhang et al. "CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-Talker Conversations." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/zhang2024neurips-covomix/) doi:10.52202/079017-3183

BibTeX

@inproceedings{zhang2024neurips-covomix,
  title     = {{CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-Talker Conversations}},
  author    = {Zhang, Leying and Qian, Yao and Zhou, Long and Liu, Shujie and Wang, Dongmei and Wang, Xiaofei and Yousefi, Midia and Qian, Yanmin and Li, Jinyu and He, Lei and Zhao, Sheng and Zeng, Michael},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3183},
  url       = {https://mlanthology.org/neurips/2024/zhang2024neurips-covomix/}
}