Flowtron: An Autoregressive Flow-Based Generative Network for Text-to-Speech Synthesis
Abstract
In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with style transfer and speech variation. Flowtron borrows insights from Autoregressive Flows and revamps Tacotron 2 in order to provide high-quality and expressive mel-spectrogram synthesis. Flowtron is optimized by maximizing the likelihood of the training data, which makes training simple and stable. Flowtron learns an invertible mapping of data to a latent space that can be used to modulate many aspects of speech synthesis (timbre, expressivity, accent). Our mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality. We provide results on speech variation, interpolation over time between samples and style transfer between seen and unseen speakers. Code and pre-trained models are publicly available at \href{https://github.com/NVIDIA/flowtron}https://github.com/NVIDIA/flowtron.
Cite
Text
Valle et al. "Flowtron: An Autoregressive Flow-Based Generative Network for Text-to-Speech Synthesis." International Conference on Learning Representations, 2021.Markdown
[Valle et al. "Flowtron: An Autoregressive Flow-Based Generative Network for Text-to-Speech Synthesis." International Conference on Learning Representations, 2021.](https://mlanthology.org/iclr/2021/valle2021iclr-flowtron/)BibTeX
@inproceedings{valle2021iclr-flowtron,
title = {{Flowtron: An Autoregressive Flow-Based Generative Network for Text-to-Speech Synthesis}},
author = {Valle, Rafael and Shih, Kevin J. and Prenger, Ryan and Catanzaro, Bryan},
booktitle = {International Conference on Learning Representations},
year = {2021},
url = {https://mlanthology.org/iclr/2021/valle2021iclr-flowtron/}
}