Parallel Neural Text-to-Speech

Abstract

In this work, we first propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and obtains 46.7 times speed-up over Deep Voice 3 at synthesis while maintaining comparable speech quality using a WaveNet vocoder. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Based on ParaNet, we build the first fully parallel neural text-to-speech system using parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We investigate several parallel vocoders within the TTS system, including variants of IAF vocoders and bipartite flow vocoder.

Cite

Text

Peng et al. "Parallel Neural Text-to-Speech." International Conference on Learning Representations, 2020.

Markdown

[Peng et al. "Parallel Neural Text-to-Speech." International Conference on Learning Representations, 2020.](https://mlanthology.org/iclr/2020/peng2020iclr-parallel/)

BibTeX

@inproceedings{peng2020iclr-parallel,
  title     = {{Parallel Neural Text-to-Speech}},
  author    = {Peng, Kainan and Ping, Wei and Song, Zhao and Zhao, Kexin},
  booktitle = {International Conference on Learning Representations},
  year      = {2020},
  url       = {https://mlanthology.org/iclr/2020/peng2020iclr-parallel/}
}