VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Abstract
Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable \textbf{Visual Foundation Agents} that are postulated to excel across a myriad of tasks. However, existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs as visual foundation agents in complex, real-world environments. To address this gap, we introduce VisualAgentBench (VAB), a comprehensive and unified benchmark specifically designed to train and evaluate LMMs as visual foundation agents across diverse scenarios in one standard setting, including Embodied, Graphical User Interface, and Visual Design, with tasks formulated to probe the depth of LMMs' understanding and interaction capabilities. Through rigorous testing across 9 proprietary LMM APIs and 9 open models (18 in total), we demonstrate the considerable yet still developing visual agent capabilities of these models. Additionally, VAB explores the synthesizing of visual agent trajectory data through hybrid methods including Program-based Solvers, LMM Agent Bootstrapping, and Human Demonstrations, offering insights into obstacles, solutions, and trade-offs one may meet in developing open LMM agents. Our work not only aims to benchmark existing models but also provides an instrumental playground for future development into visual foundation agents. Code, train, and test data are available at \url{https://github.com/THUDM/VisualAgentBench}.
Cite
Text
Liu et al. "VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents." International Conference on Learning Representations, 2025.Markdown
[Liu et al. "VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/liu2025iclr-visualagentbench/)BibTeX
@inproceedings{liu2025iclr-visualagentbench,
title = {{VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents}},
author = {Liu, Xiao and Zhang, Tianjie and Gu, Yu and Iong, Iat Long and XiXuan, Song and Xu, Yifan and Zhang, Shudan and Lai, Hanyu and Sun, Jiadai and Yang, Xinyue and Yang, Yu and Qi, Zehan and Yao, Shuntian and Sun, Xueqiao and Cheng, Siyi and Zheng, Qinkai and Yu, Hao and Zhang, Hanchen and Hong, Wenyi and Ding, Ming and Pan, Lihang and Gu, Xiaotao and Zeng, Aohan and Du, Zhengxiao and Song, Chan Hee and Su, Yu and Dong, Yuxiao and Tang, Jie},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/liu2025iclr-visualagentbench/}
}