Non-Autoregressive Neural Text-to-Speech

Abstract

In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system by applying various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained WaveNet as previous work.

Cite

Text

Peng et al. "Non-Autoregressive Neural Text-to-Speech." International Conference on Machine Learning, 2020.

Markdown

[Peng et al. "Non-Autoregressive Neural Text-to-Speech." International Conference on Machine Learning, 2020.](https://mlanthology.org/icml/2020/peng2020icml-nonautoregressive/)

BibTeX

@inproceedings{peng2020icml-nonautoregressive,
  title     = {{Non-Autoregressive Neural Text-to-Speech}},
  author    = {Peng, Kainan and Ping, Wei and Song, Zhao and Zhao, Kexin},
  booktitle = {International Conference on Machine Learning},
  year      = {2020},
  pages     = {7586-7598},
  volume    = {119},
  url       = {https://mlanthology.org/icml/2020/peng2020icml-nonautoregressive/}
}