DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

Abstract

Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM’s autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs’ capabilities. Experimental results demonstrate that DrVoice-7B establishes new state-of-the-art (SOTA) on prominent speech benchmarks including OpenAudioBench, VoiceBench, UltraEval-Audio and Big Bench Audio, making it a leading open-source speech foundation model in ∼7B models.

Cite

Text

Tan et al. "DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations." International Conference on Learning Representations, 2026.

Markdown

[Tan et al. "DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/tan2026iclr-drvoice/)

BibTeX

@inproceedings{tan2026iclr-drvoice,
  title     = {{DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations}},
  author    = {Tan, Chao-Hong and Chen, Qian and Wang, Wen and Deng, Chong and Zhang, Qinglin and Cheng, Luyao and Yu, Hai and Zhang, Xin and Lyu, Xiang and Zhao, Tianyu and Zhang, Chong and Ma, Yukun and Chen, Yafeng and Wang, Hui and Liu, Jiaqing and Li, Xiangang and Ye, Jieping},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/tan2026iclr-drvoice/}
}