DiffWave: A Versatile Diffusion Model for Audio Synthesis

Abstract

In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

Cite

Text

Kong et al. "DiffWave: A Versatile Diffusion Model for Audio Synthesis." International Conference on Learning Representations, 2021.

Markdown

[Kong et al. "DiffWave: A Versatile Diffusion Model for Audio Synthesis." International Conference on Learning Representations, 2021.](https://mlanthology.org/iclr/2021/kong2021iclr-diffwave/)

BibTeX

@inproceedings{kong2021iclr-diffwave,
  title     = {{DiffWave: A Versatile Diffusion Model for Audio Synthesis}},
  author    = {Kong, Zhifeng and Ping, Wei and Huang, Jiaji and Zhao, Kexin and Catanzaro, Bryan},
  booktitle = {International Conference on Learning Representations},
  year      = {2021},
  url       = {https://mlanthology.org/iclr/2021/kong2021iclr-diffwave/}
}