MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Wang, Yuancheng; Zhan, Haoyue; Liu, Liwei; Zeng, Ruihong; Guo, Haotian; Zheng, Jiachen; Zhang, Qiang; Zhang, Xueyao; Zhang, Shunsi; Wu, Zhizheng

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, Zhizheng Wu

ICLR 2025

/iclr/2025/wang2025iclr-maskgct/

Abstract

The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems. The autoregressive systems implicitly model duration but exhibit certain deficiencies in robustness and lack of duration controllability. Non-autoregressive systems require explicit alignment information between text and speech during training and predict durations for linguistic units (e.g. phone), which may compromise their naturalness. In this paper, we introduce $\textbf{Mask}$ed $\textbf{G}$enerative $\textbf{C}$odec $\textbf{T}$ransformer (MaskGCT), a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the mask-and-predict learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. Experiments with 100K hours of in-the-wild speech demonstrate that MaskGCT outperforms the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility. Audio samples are available at https://maskgct.github.io/. We release our code and model checkpoints at https://github.com/open-mmlab/Amphion/blob/main/models/tts/maskgct.

PDF ICLR Semantic Scholar

Cite

Text

Wang et al. "MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer." International Conference on Learning Representations, 2025.

Markdown

[Wang et al. "MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/wang2025iclr-maskgct/)

BibTeX

@inproceedings{wang2025iclr-maskgct,
  title     = {{MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer}},
  author    = {Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/wang2025iclr-maskgct/}
}