PITS: Variational Pitch Inference Without Fundamental Frequency for End-to-End Pitch-Controllable TTS

Abstract

Previous pitch-controllable text-to-speech (TTS) models rely on directly modeling fundamental frequency, leading to low variance in synthesized speech. To address this issue, we propose PITS, an end-to-end pitch-controllable TTS model that utilizes variational inference to model pitch. Based on VITS, PITS incorporates the Yingram encoder, the Yingram decoder, and adversarial training of pitch-shifted synthesis to achieve pitch-controllability. Experiments demonstrate that PITS generates high-quality speech that is indistinguishable from ground truth speech and has high pitch-controllability without quality degradation. Code, audio samples, and demo are available at https://github.com/anonymous-pits/pits.

Cite

Text

Lee et al. "PITS: Variational Pitch Inference Without Fundamental Frequency for End-to-End Pitch-Controllable TTS." ICML 2023 Workshops: SPIGM, 2023.

Markdown

[Lee et al. "PITS: Variational Pitch Inference Without Fundamental Frequency for End-to-End Pitch-Controllable TTS." ICML 2023 Workshops: SPIGM, 2023.](https://mlanthology.org/icmlw/2023/lee2023icmlw-pits/)

BibTeX

@inproceedings{lee2023icmlw-pits,
  title     = {{PITS: Variational Pitch Inference Without Fundamental Frequency for End-to-End Pitch-Controllable TTS}},
  author    = {Lee, Junhyeok and Jung, Wonbin and Cho, Hyunjae and Kim, Jaeyeon and Kim, Jaehwan},
  booktitle = {ICML 2023 Workshops: SPIGM},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/lee2023icmlw-pits/}
}