DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

Tan, Chao-Hong; Chen, Qian; Wang, Wen; Deng, Chong; Zhang, Qinglin; Cheng, Luyao; Yu, Hai; Zhang, Xin; Lyu, Xiang; Zhao, Tianyu; Zhang, Chong; Ma, Yukun; Chen, Yafeng; Wang, Hui; Liu, Jiaqing; Li, Xiangang; Ye, Jieping

DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lyu, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Xiangang Li, Jieping Ye

ICLR 2026

/iclr/2026/tan2026iclr-drvoice/

Abstract

Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM’s autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs’ capabilities. Experimental results demonstrate that DrVoice-7B establishes new state-of-the-art (SOTA) on prominent speech benchmarks including OpenAudioBench, VoiceBench, UltraEval-Audio and Big Bench Audio, making it a leading open-source speech foundation model in ∼7B models.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Tan et al. "DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations." International Conference on Learning Representations, 2026.

Markdown

[Tan et al. "DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/tan2026iclr-drvoice/)

BibTeX

@inproceedings{tan2026iclr-drvoice,
  title     = {{DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations}},
  author    = {Tan, Chao-Hong and Chen, Qian and Wang, Wen and Deng, Chong and Zhang, Qinglin and Cheng, Luyao and Yu, Hai and Zhang, Xin and Lyu, Xiang and Zhao, Tianyu and Zhang, Chong and Ma, Yukun and Chen, Yafeng and Wang, Hui and Liu, Jiaqing and Li, Xiangang and Ye, Jieping},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/tan2026iclr-drvoice/}
}