PolyVoice: Language Models for Speech to Speech Translation

Abstract

With the huge success of GPT models in natural language processing, there is a growing interest in applying language modeling approaches to speech tasks. Currently, the dominant architecture in speech-to-speech translation (S2ST) remains the encoder-decoder paradigm, creating a need to investigate the impact of language modeling approaches in this area. In this study, we introduce PolyVoice, a language model-based framework designed for S2ST systems. Our framework comprises three decoder-only language models: a translation language model, a duration language model, and a speech synthesis language model. These language models employ different types of prompts to extract learned information effectively. By utilizing unsupervised semantic units, our framework can transfer semantic information across these models, making it applicable even to unwritten languages. We evaluate our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish language pairs. Experimental results demonstrate that \method outperforms the state-of-the-art encoder-decoder model, producing voice-cloned speech with high translation and audio quality. Speech samples are available at https://polyvoice.github.io.

Cite

Text

Dong et al. "PolyVoice: Language Models for Speech to Speech Translation." International Conference on Learning Representations, 2024.

Markdown

[Dong et al. "PolyVoice: Language Models for Speech to Speech Translation." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/dong2024iclr-polyvoice/)

BibTeX

@inproceedings{dong2024iclr-polyvoice,
  title     = {{PolyVoice: Language Models for Speech to Speech Translation}},
  author    = {Dong, Qian qian and Huang, Zhiying and Tian, Qiao and Xu, Chen and Ko, Tom and Zhao, Yunlong and Feng, Siyuan and Li, Tang and Wang, Kexin and Cheng, Xuxin and Yue, Fengpeng and Bai, Ye and Chen, Xi and Lu, Lu and Ma, Zejun and Wang, Yuping and Wang, Mingxuan and Wang, Yuxuan},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/dong2024iclr-polyvoice/}
}