Multi-Modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

ICLR 2025

/iclr/2025/gao2025iclr-multimodal/

Abstract

The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories, followed by query-file and trajectory verifiers. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. Then, we develop the T3-Agent via Trajectory Tuning on VLMs for Tool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: MiniCPM-V-8.5B and Qwen2-VL-7B, which outperforms untrained VLMs by 20%, showing the effectiveness of the proposed data synthesis pipeline, leading to high-quality data for tool-usage capabilities.

PDF ICLR Semantic Scholar

Cite

Text

Gao et al. "Multi-Modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage." International Conference on Learning Representations, 2025.

Markdown

[Gao et al. "Multi-Modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/gao2025iclr-multimodal/)

BibTeX

@inproceedings{gao2025iclr-multimodal,
  title     = {{Multi-Modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage}},
  author    = {Gao, Zhi and Zhang, Bofei and Li, Pengxiang and Ma, Xiaojian and Yuan, Tao and Fan, Yue and Wu, Yuwei and Jia, Yunde and Zhu, Song-Chun and Li, Qing},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/gao2025iclr-multimodal/}
}