Multi-Modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage
Abstract
The advancement of large language models (LLMs) prompts the development of multi-modal agents, providing a feasible way to solve practical tasks by using tools. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o model to separately generate queries, files, and trajectories, followed by a query-file verifier and trajectory verifier. Based on the data synthesis pipeline, we collect the MM-traj dataset with 20k tasks using 10 tools. Then, we build the T3-agent that uses MiniCPM-V as the controller Trajectory Tuning for Tool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-agent has achieved remarkable improvements and outperforms GPT-4 driven agents by 10%, showing the effectiveness of the proposed data synthesis pipeline that leads to better reasoning capabilities in tool usage.
Cite
Text
Gao et al. "Multi-Modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.Markdown
[Gao et al. "Multi-Modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.](https://mlanthology.org/iclrw/2025/gao2025iclrw-multimodal/)BibTeX
@inproceedings{gao2025iclrw-multimodal,
title = {{Multi-Modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage}},
author = {Gao, Zhi and Zhang, Bofei and Li, Pengxiang and Ma, Xiaojian and Yuan, Tao and Fan, Yue and Wu, Yuwei and Jia, Yunde and Zhu, Song-Chun and Li, Qing},
booktitle = {ICLR 2025 Workshops: LLM_Reason_and_Plan},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/gao2025iclrw-multimodal/}
}