E3-VITS: Emotional End-to-End TTS with Cross-Speaker Style Transfer

Abstract

Since previous emotional TTS models are based on a two-stage pipeline or additional labels, their training process is complex and requires a high labeling cost. To deal with this problem, this paper presents E3-VITS, an end-to-end emotional TTS model that addresses the limitations of existing models. E3-VITS synthesizes high-quality speeches for multi-speaker conditions, supports both reference speech and textual description-based emotional speech synthesis, and enables cross-speaker emotion transfer with a disjoint dataset. To implement E3-VITS, we propose batch-permuted style perturbation, which generates audio samples with unpaired emotion to increase the quality of cross-speaker emotion transfer. Results show that E3-VITS outperforms the baseline model in terms of naturalness, speaker and emotion similarity, and inference speed.

Cite

Text

Jung and Lee. "E3-VITS: Emotional End-to-End TTS with Cross-Speaker Style Transfer." ICML 2023 Workshops: DeployableGenerativeAI, 2023.

Markdown

[Jung and Lee. "E3-VITS: Emotional End-to-End TTS with Cross-Speaker Style Transfer." ICML 2023 Workshops: DeployableGenerativeAI, 2023.](https://mlanthology.org/icmlw/2023/jung2023icmlw-e3vits/)

BibTeX

@inproceedings{jung2023icmlw-e3vits,
  title     = {{E3-VITS: Emotional End-to-End TTS with Cross-Speaker Style Transfer}},
  author    = {Jung, Wonbin and Lee, Junhyeok},
  booktitle = {ICML 2023 Workshops: DeployableGenerativeAI},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/jung2023icmlw-e3vits/}
}