DMT-RoleBench: A Dynamic Multi-Turn Dialogue Based Benchmark for Role-Playing Evaluation of Large Language Model and Agent
Abstract
Recent years have witnessed a profound evolution in the abilities of Large Language Model, which has significantly boosted the proliferation of role-playing agents and platforms. Nonetheless, there is a conspicuous absence of systematic and comprehensive evaluations of role-playing abilities which are truly aligned with users' interaction scenarios in real-world. To address this gap, we have devised DMT-RoleBench, a benchmark designed to evaluate the role-playing abilities of large language models and agents based on dynamic multi-turn dialogues. Compared with existed role-playing benchmarks, DMT-RoleBench boasts several principal advantages: (1) It contains a more diverse role types and system prompts of different formats. (2) We propose an innovative evaluation paradigm to assess role-playing abilities based on dynamically generating multi-turn dialogues constrained by specific evaluation intents and topics, which is well aligned with users' interaction scenarios in real-world. (3) We define a three-tiered metric system and provide DMT-RM, which is a reward model aligned with human annotations, to annotate the dialogues. And we propose DMT-Score to calculate the final scores based on the annotated dialogues. Our experiments and analysis of leading models equipped with role-playing abilities have demonstrated the effectiveness of DMT-RoleBench.
Cite
Text
Yuan et al. "DMT-RoleBench: A Dynamic Multi-Turn Dialogue Based Benchmark for Role-Playing Evaluation of Large Language Model and Agent." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I24.34768Markdown
[Yuan et al. "DMT-RoleBench: A Dynamic Multi-Turn Dialogue Based Benchmark for Role-Playing Evaluation of Large Language Model and Agent." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/yuan2025aaai-dmt/) doi:10.1609/AAAI.V39I24.34768BibTeX
@inproceedings{yuan2025aaai-dmt,
title = {{DMT-RoleBench: A Dynamic Multi-Turn Dialogue Based Benchmark for Role-Playing Evaluation of Large Language Model and Agent}},
author = {Yuan, Dingbo and Chen, Yipeng and Liu, Guodong and Li, Chenchen and Tang, Chengfu and Zhang, Dongxu and Wang, Zhenkui and Wang, Xudong and Liu, Song},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {25760-25768},
doi = {10.1609/AAAI.V39I24.34768},
url = {https://mlanthology.org/aaai/2025/yuan2025aaai-dmt/}
}