RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis

Abstract

This work introduces a predominantly parallel, end-to-end TTS model based on normalizing flows. It extends prior parallel approaches by additionally modeling speech rhythm as a separate generative distribution to facilitate variable token duration during inference. We further propose a robust framework for the on-line extraction of speech-text alignments -- a critical yet highly unstable learning problem in end-to-end TTS frameworks. Our experiments demonstrate that our proposed techniques yield improved alignment quality, better output diversity compared to controlled baselines.

Cite

Text

Shih et al. "RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis." ICML 2021 Workshops: INNF, 2021.

Markdown

[Shih et al. "RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis." ICML 2021 Workshops: INNF, 2021.](https://mlanthology.org/icmlw/2021/shih2021icmlw-radtts/)

BibTeX

@inproceedings{shih2021icmlw-radtts,
  title     = {{RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis}},
  author    = {Shih, Kevin J. and Valle, Rafael and Badlani, Rohan and Lancucki, Adrian and Ping, Wei and Catanzaro, Bryan},
  booktitle = {ICML 2021 Workshops: INNF},
  year      = {2021},
  url       = {https://mlanthology.org/icmlw/2021/shih2021icmlw-radtts/}
}