Parallel Neural Text-to-Speech

Peng, Kainan; Ping, Wei; Song, Zhao; Zhao, Kexin

Parallel Neural Text-to-Speech

Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao

ICLR 2020

/iclr/2020/peng2020iclr-parallel/

Abstract

In this work, we first propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and obtains 46.7 times speed-up over Deep Voice 3 at synthesis while maintaining comparable speech quality using a WaveNet vocoder. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Based on ParaNet, we build the first fully parallel neural text-to-speech system using parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We investigate several parallel vocoders within the TTS system, including variants of IAF vocoders and bipartite flow vocoder.

PDF ICLR Semantic Scholar

Cite

Text

Peng et al. "Parallel Neural Text-to-Speech." International Conference on Learning Representations, 2020.

Markdown

[Peng et al. "Parallel Neural Text-to-Speech." International Conference on Learning Representations, 2020.](https://mlanthology.org/iclr/2020/peng2020iclr-parallel/)

BibTeX

@inproceedings{peng2020iclr-parallel,
  title     = {{Parallel Neural Text-to-Speech}},
  author    = {Peng, Kainan and Ping, Wei and Song, Zhao and Zhao, Kexin},
  booktitle = {International Conference on Learning Representations},
  year      = {2020},
  url       = {https://mlanthology.org/iclr/2020/peng2020iclr-parallel/}
}