Multi-Modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

Abstract

The advancement of large language models (LLMs) prompts the development of multi-modal agents, providing a feasible way to solve practical tasks by using tools. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o model to separately generate queries, files, and trajectories, followed by a query-file verifier and trajectory verifier. Based on the data synthesis pipeline, we collect the MM-traj dataset with 20k tasks using 10 tools. Then, we build the T3-agent that uses MiniCPM-V as the controller Trajectory Tuning for Tool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-agent has achieved remarkable improvements and outperforms GPT-4 driven agents by 10%, showing the effectiveness of the proposed data synthesis pipeline that leads to better reasoning capabilities in tool usage.

Cite

Text

Gao et al. "Multi-Modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.

Markdown

[Gao et al. "Multi-Modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.](https://mlanthology.org/iclrw/2025/gao2025iclrw-multimodal/)

BibTeX

@inproceedings{gao2025iclrw-multimodal,
  title     = {{Multi-Modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage}},
  author    = {Gao, Zhi and Zhang, Bofei and Li, Pengxiang and Ma, Xiaojian and Yuan, Tao and Fan, Yue and Wu, Yuwei and Jia, Yunde and Zhu, Song-Chun and Li, Qing},
  booktitle = {ICLR 2025 Workshops: LLM_Reason_and_Plan},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/gao2025iclrw-multimodal/}
}