SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

Zhang, Xueyao; Wang, Chaoren; Liao, Huan; Li, Ziniu; Wang, Yuancheng; Wang, Li; Jia, Dongya; Chen, Yuanzhe; Li, Xiulin; Chen, Zhuo; Wu, Zhizheng

SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

Xueyao Zhang, Chaoren Wang, Huan Liao, Ziniu Li, Yuancheng Wang, Li Wang, Dongya Jia, Yuanzhe Chen, Xiulin Li, Zhuo Chen, Zhizheng Wu

ICLR 2026

/iclr/2026/zhang2026iclr-speechjudge/

Abstract

Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce ***SpeechJudge***, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness—one of the most fundamental subjective metrics for speech synthesis. First, we present ***SpeechJudge-Data***, a large-scale human feedback corpus of 99k speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish ***SpeechJudge-Eval***, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the best-performing model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop ***SpeechJudge-GRM***, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zhang et al. "SpeechJudge: Towards Human-Level Judgment for Speech Naturalness." International Conference on Learning Representations, 2026.

Markdown

[Zhang et al. "SpeechJudge: Towards Human-Level Judgment for Speech Naturalness." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhang2026iclr-speechjudge/)

BibTeX

@inproceedings{zhang2026iclr-speechjudge,
  title     = {{SpeechJudge: Towards Human-Level Judgment for Speech Naturalness}},
  author    = {Zhang, Xueyao and Wang, Chaoren and Liao, Huan and Li, Ziniu and Wang, Yuancheng and Wang, Li and Jia, Dongya and Chen, Yuanzhe and Li, Xiulin and Chen, Zhuo and Wu, Zhizheng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhang2026iclr-speechjudge/}
}