WaveFlow: A Compact Flow-Based Model for Raw Audio

Abstract

In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases. It generates high-fidelity speech as WaveNet, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps. Furthermore, it can significantly reduce the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has only 5.91M parameters, which is 15{\texttimes} smaller than WaveGlow. It can generate 22.05 kHz high-fidelity audio 42.6{\texttimes} faster than real-time (at a rate of 939.3 kHz) on a V100 GPU without engineered inference kernels.

Cite

Text

Ping et al. "WaveFlow: A Compact Flow-Based Model for Raw Audio." International Conference on Machine Learning, 2020.

Markdown

[Ping et al. "WaveFlow: A Compact Flow-Based Model for Raw Audio." International Conference on Machine Learning, 2020.](https://mlanthology.org/icml/2020/ping2020icml-waveflow/)

BibTeX

@inproceedings{ping2020icml-waveflow,
  title     = {{WaveFlow: A Compact Flow-Based Model for Raw Audio}},
  author    = {Ping, Wei and Peng, Kainan and Zhao, Kexin and Song, Zhao},
  booktitle = {International Conference on Machine Learning},
  year      = {2020},
  pages     = {7706-7716},
  volume    = {119},
  url       = {https://mlanthology.org/icml/2020/ping2020icml-waveflow/}
}