Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Ping, Wei; Peng, Kainan; Gibiansky, Andrew; Arik, Sercan O.; Kannan, Ajay; Narang, Sharan; Raiman, Jonathan; Miller, John

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, John Miller

ICLR 2018

/iclr/2018/ping2018iclr-deep/

Abstract

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training an order of magnitude faster. We scale Deep Voice 3 to dataset sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on a single GPU server.

PDF ICLR Code Semantic Scholar

Cite

Text

Ping et al. "Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning." International Conference on Learning Representations, 2018.

Markdown

[Ping et al. "Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning." International Conference on Learning Representations, 2018.](https://mlanthology.org/iclr/2018/ping2018iclr-deep/)

BibTeX

@inproceedings{ping2018iclr-deep,
  title     = {{Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning}},
  author    = {Ping, Wei and Peng, Kainan and Gibiansky, Andrew and Arik, Sercan O. and Kannan, Ajay and Narang, Sharan and Raiman, Jonathan and Miller, John},
  booktitle = {International Conference on Learning Representations},
  year      = {2018},
  url       = {https://mlanthology.org/iclr/2018/ping2018iclr-deep/}
}