P-Flow: A Fast and Data-Efficient Zero-Shot TTS Through Speech Prompting
Abstract
While recent large-scale neural codec language models have shown significant improvement in zero-shot TTS by training on thousands of hours of data, they suffer from drawbacks such as a lack of robustness, slow sampling speed similar to previous autoregressive TTS methods, and reliance on pre-trained neural codec representations. Our work proposes P-Flow, a fast and data-efficient zero-shot TTS model that uses speech prompts for speaker adaptation. P-Flow comprises a speech-prompted text encoder for speaker adaptation and a flow matching generative decoder for high-quality and fast speech synthesis. Our speech-prompted text encoder uses speech prompts and text input to generate speaker-conditional text representation. The flow matching generative decoder uses the speaker-conditional output to synthesize high-quality personalized speech significantly faster than in real-time. Unlike the neural codec language models, we specifically train P-Flow on LibriTTS dataset using a continuous mel-representation. Through our training method using continuous speech prompts, P-Flow matches the speaker similarity performance of the large-scale zero-shot TTS models with two orders of magnitude less training data and has more than 20$\times$ faster sampling speed. Our results show that P-Flow has better pronunciation and is preferred in human likeness and speaker similarity to its recent state-of-the-art counterparts, thus defining P-Flow as an attractive and desirable alternative. We provide audio samples on our demo page: [https://research.nvidia.com/labs/adlr/projects/pflow](https://research.nvidia.com/labs/adlr/projects/pflow)
Cite
Text
Kim et al. "P-Flow: A Fast and Data-Efficient Zero-Shot TTS Through Speech Prompting." Neural Information Processing Systems, 2023.Markdown
[Kim et al. "P-Flow: A Fast and Data-Efficient Zero-Shot TTS Through Speech Prompting." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/kim2023neurips-pflow/)BibTeX
@inproceedings{kim2023neurips-pflow,
title = {{P-Flow: A Fast and Data-Efficient Zero-Shot TTS Through Speech Prompting}},
author = {Kim, Sungwon and Shih, Kevin and Badlani, Rohan and Santos, Joao Felipe and Bakhturina, Evelina and Desta, Mikyas and Valle, Rafael and Yoon, Sungroh and Catanzaro, Bryan},
booktitle = {Neural Information Processing Systems},
year = {2023},
url = {https://mlanthology.org/neurips/2023/kim2023neurips-pflow/}
}