Hierarchical Generative Modeling for Controllable Speech Synthesis
Abstract
This paper proposes a neural end-to-end text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation demonstrates its ability to control the aforementioned attributes. In particular, it is capable of consistently synthesizing high-quality clean speech regardless of the quality of the training data for the target speaker.
Cite
Text
Hsu et al. "Hierarchical Generative Modeling for Controllable Speech Synthesis." International Conference on Learning Representations, 2019.Markdown
[Hsu et al. "Hierarchical Generative Modeling for Controllable Speech Synthesis." International Conference on Learning Representations, 2019.](https://mlanthology.org/iclr/2019/hsu2019iclr-hierarchical/)BibTeX
@inproceedings{hsu2019iclr-hierarchical,
title = {{Hierarchical Generative Modeling for Controllable Speech Synthesis}},
author = {Hsu, Wei-Ning and Zhang, Yu and Weiss, Ron J. and Zen, Heiga and Wu, Yonghui and Wang, Yuxuan and Cao, Yuan and Jia, Ye and Chen, Zhifeng and Shen, Jonathan and Nguyen, Patrick and Pang, Ruoming},
booktitle = {International Conference on Learning Representations},
year = {2019},
url = {https://mlanthology.org/iclr/2019/hsu2019iclr-hierarchical/}
}