Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Abstract

With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from a single speech audio. Furthermore, to enhance StyleSpeech’s adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker’s voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.

Cite

Text

Min et al. "Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation." International Conference on Machine Learning, 2021.

Markdown

[Min et al. "Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation." International Conference on Machine Learning, 2021.](https://mlanthology.org/icml/2021/min2021icml-metastylespeech/)

BibTeX

@inproceedings{min2021icml-metastylespeech,
  title     = {{Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation}},
  author    = {Min, Dongchan and Lee, Dong Bok and Yang, Eunho and Hwang, Sung Ju},
  booktitle = {International Conference on Machine Learning},
  year      = {2021},
  pages     = {7748-7759},
  volume    = {139},
  url       = {https://mlanthology.org/icml/2021/min2021icml-metastylespeech/}
}