TaDiCodec: Text-Aware Diffusion Speech Tokenizer for Speech Language Modeling

Abstract

Speech tokenizers serve as foundational components for speech language models, yet current designs exhibit several limitations, including: (1) dependence on multi-layer residual vector quantization structures or high frame rates, (2) reliance on auxiliary pre-trained models for semantic distillation, and (3) requirements for complex two-stage training processes. In this work, we introduce the **T**ext-**a**ware **Di**ffusion Transformer Speech **Codec** (***TaDiCodec***), a novel approach designed to overcome these challenges. TaDiCodec employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of **6.25 Hz** and a corresponding bitrate of **0.0875 kbps** with a **single-layer codebook** for 24 kHz speech, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS), Notably, TaDiCodec employs a single-stage, end-to-end training paradigm, and obviating the need for auxiliary pre-trained models. We also validate the compatibility of TaDiCodec in language model based zero-shot text-to-speech with both autoregressive modeling and masked generative modeling, demonstrating its effectiveness and efficiency for speech language modeling, as well as a significantly small *reconstruction-generation gap*. To facilitate reproducibility and further research, we will make our source code and pre-trained checkpoints publicly available. Audio samples are are available at https://tadicodec.github.io/. We release code and model checkpoints at https://github.com/AmphionTeam/TaDiCodec.

Cite

Text

Wang et al. "TaDiCodec: Text-Aware Diffusion Speech Tokenizer for Speech Language Modeling." Advances in Neural Information Processing Systems, 2025.

Markdown

[Wang et al. "TaDiCodec: Text-Aware Diffusion Speech Tokenizer for Speech Language Modeling." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/wang2025neurips-tadicodec/)

BibTeX

@inproceedings{wang2025neurips-tadicodec,
  title     = {{TaDiCodec: Text-Aware Diffusion Speech Tokenizer for Speech Language Modeling}},
  author    = {Wang, Yuancheng and Chen, Dekun and Zhang, Xueyao and Zhang, Junan and Li, Jiaqi and Wu, Zhizheng},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/wang2025neurips-tadicodec/}
}