ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capability for Large Vision-Language Models
Abstract
Multi-turn visual conversation is an important ability of real-world AI assistants. However, the related evaluation benchmark is missed. This paper presents ConvBench, a multi-turn conversation benchmark with hierarchical capabilities ablation evaluation for Large Vision-Language Models (LVLMs). ConvBench comprises 577 curated multi-turn conversations, encompassing 215 tasks. These tasks are broad and open-ended, which resemble real-world user behaviors. ConvBench progressively examines the LVLMs' perception, reasoning, and creativity capabilities in each conversation and can decouple these capabilities in evaluations and thus perform reliable error attribution. Besides, considering the diversity of open-ended questions, we introduce an efficient and reliable automatic evaluation framework. Experimental results reveal that ConvBench is a significant challenge for current LVLMs, even for GPT4V, which achieves only a 39.51% score. Besides, we have some insightful findings, such as the weak perception of LVLMs inhibits authentic strengths in reasoning and creation. We believe our design of hierarchical capabilities, decoupling capabilities evaluation, and multi-turn conversation can blaze a new trail in LVLMs evaluation. Code and benchmark are released at https://github.com/shirlyliu64/ConvBench.
Cite
Text
Liu et al. "ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capability for Large Vision-Language Models." Neural Information Processing Systems, 2024. doi:10.52202/079017-3195Markdown
[Liu et al. "ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capability for Large Vision-Language Models." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/liu2024neurips-convbench/) doi:10.52202/079017-3195BibTeX
@inproceedings{liu2024neurips-convbench,
title = {{ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capability for Large Vision-Language Models}},
author = {Liu, Shuo and Ying, Kaining and Zhang, Hao and Yang, Yue and Lin, Yuqi and Zhang, Tianle and Li, Chuanhao and Qiao, Yu and Luo, Ping and Shao, Wenqi and Zhang, Kaipeng},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-3195},
url = {https://mlanthology.org/neurips/2024/liu2024neurips-convbench/}
}