Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Abstract

Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results show that no model performed well universally. SALMONN-13B excelled in English ASR and Qwen2-Audio-7B-Instruct showed high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We open-source all task data and the evaluation pipeline at https://github.com/dynamic-superb/dynamic-superb.

Cite

Text

Huang et al. "Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks." International Conference on Learning Representations, 2025.

Markdown

[Huang et al. "Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/huang2025iclr-dynamicsuperb/)

BibTeX

@inproceedings{huang2025iclr-dynamicsuperb,
  title     = {{Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks}},
  author    = {Huang, Chien-yu and Chen, Wei-Chih and Yang, Shu-wen and Liu, Andy T. and Li, Chen-An and Lin, Yu-Xiang and Tseng, Wei-Cheng and Diwan, Anuj and Shih, Yi-Jen and Shi, Jiatong and Chen, William and Yang, Chih-Kai and Chen, Xuanjun and Hsiao, Chi-Yuan and Peng, Puyuan and Wang, Shih-Heng and Kuan, Chun-Yi and Lu, Ke-Han and Chang, Kai-Wei and Gutierrez, Fabian Alejandro Ritter and Kuan-Po, Huang and Arora, Siddhant and Lin, You-Kuan and To, CHUANG Ming and Yeo, Eunjung and Chang, Kalvin and Chien, Chung-Ming and Choi, Kwanghee and Hsieh, Cheng-Hsiu and Lin, Yi-Cheng and Yu, Chee-En and Chiu, I-Hsiang and Guimarães, Heitor and Han, Jionghao and Lin, Tzu-Quan and Lin, Tzu-Yuan and Chang, Homu and Chang, Ting-Wu and Chen, Chun Wei and Chen, Shou-Jen and Chen, Yu-Hua and Cheng, Hsi-Chun and Dhawan, Kunal and Fang, Jia-Lin and Fang, Shi-Xin and Chiang, Kuan Yu Fang and Fu, Chi An and Hsiao, Hsien-Fu and Hsu, Ching Yu and Huang, Shao-Syuan and Wei, Lee Chen and Lin, Hsi-Che and Lin, Hsuan-Hao and Lin, Hsuan-Ting and Lin, Jian-Ren and Liu, Ting-Chun and Lu, Li-Chun and Pai, Tsung-Min and Pasad, Ankita and Kuan, Shih-Yun Shan and Shon, Suwon and Tang, Yuxun and Tsai, Yun-Shao and Chiang, Wei Jui and Wei, Tzu-Chieh and Wu, Chengxi and Wu, Dien-Ruei and Yang, Chao-Han Huck and Yang, Chieh-Chi and Yip, Jia Qi and Yuan, Shao-Xiang and Wu, Haibin and Livescu, Karen and Harwath, David and Watanabe, Shinji and Lee, Hung-yi},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/huang2025iclr-dynamicsuperb/}
}