Scaling Speech Tokenizers with Diffusion Autoencoders

Abstract

Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of 12.5 Hz and a bit-rate of 200 bits-per-second.

Cite

Text

Wang et al. "Scaling Speech Tokenizers with Diffusion Autoencoders." International Conference on Learning Representations, 2026.

Markdown

[Wang et al. "Scaling Speech Tokenizers with Diffusion Autoencoders." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wang2026iclr-scaling/)

BibTeX

@inproceedings{wang2026iclr-scaling,
  title     = {{Scaling Speech Tokenizers with Diffusion Autoencoders}},
  author    = {Wang, Yuancheng and Tang, Zhenyu and Wang, Yun and Hinsvark, Arthur and Liu, Yingru and Li, Yinghao Aaron and Peng, Kainan and Ao, Junyi and Ma, Mingbo and Seltzer, Mike and He, Qing and Liu, Xubo},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wang2026iclr-scaling/}
}