Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Abstract

In this work, we propose “global style tokens” (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable “labels” they generate can be used to control synthesis in novel ways, such as varying speed and speaking style – independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Cite

Text

Wang et al. "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis." International Conference on Machine Learning, 2018.

Markdown

[Wang et al. "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis." International Conference on Machine Learning, 2018.](https://mlanthology.org/icml/2018/wang2018icml-style/)

BibTeX

@inproceedings{wang2018icml-style,
  title     = {{Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis}},
  author    = {Wang, Yuxuan and Stanton, Daisy and Zhang, Yu and Ryan, RJ-Skerry and Battenberg, Eric and Shor, Joel and Xiao, Ying and Jia, Ye and Ren, Fei and Saurous, Rif A.},
  booktitle = {International Conference on Machine Learning},
  year      = {2018},
  pages     = {5180-5189},
  volume    = {80},
  url       = {https://mlanthology.org/icml/2018/wang2018icml-style/}
}