RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis
Abstract
This work introduces a predominantly parallel, end-to-end TTS model based on normalizing flows. It extends prior parallel approaches by additionally modeling speech rhythm as a separate generative distribution to facilitate variable token duration during inference. We further propose a robust framework for the on-line extraction of speech-text alignments -- a critical yet highly unstable learning problem in end-to-end TTS frameworks. Our experiments demonstrate that our proposed techniques yield improved alignment quality, better output diversity compared to controlled baselines.
Cite
Text
Shih et al. "RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis." ICML 2021 Workshops: INNF, 2021.Markdown
[Shih et al. "RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis." ICML 2021 Workshops: INNF, 2021.](https://mlanthology.org/icmlw/2021/shih2021icmlw-radtts/)BibTeX
@inproceedings{shih2021icmlw-radtts,
title = {{RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis}},
author = {Shih, Kevin J. and Valle, Rafael and Badlani, Rohan and Lancucki, Adrian and Ping, Wei and Catanzaro, Bryan},
booktitle = {ICML 2021 Workshops: INNF},
year = {2021},
url = {https://mlanthology.org/icmlw/2021/shih2021icmlw-radtts/}
}