Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
Abstract
With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from a single speech audio. Furthermore, to enhance StyleSpeech’s adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker’s voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.
Cite
Text
Min et al. "Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation." International Conference on Machine Learning, 2021.Markdown
[Min et al. "Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation." International Conference on Machine Learning, 2021.](https://mlanthology.org/icml/2021/min2021icml-metastylespeech/)BibTeX
@inproceedings{min2021icml-metastylespeech,
title = {{Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation}},
author = {Min, Dongchan and Lee, Dong Bok and Yang, Eunho and Hwang, Sung Ju},
booktitle = {International Conference on Machine Learning},
year = {2021},
pages = {7748-7759},
volume = {139},
url = {https://mlanthology.org/icml/2021/min2021icml-metastylespeech/}
}