ToolVQA: A Dataset for Multi-Step Reasoning VQA with External Tools

Yin, Shaofeng; Lei, Ting; Liu, Yang

ToolVQA: A Dataset for Multi-Step Reasoning VQA with External Tools

ICCV 2025 pp. 4424-4433

/iccv/2025/yin2025iccv-toolvqa/

Abstract

Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks re- veal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings re- quiring multi-step reasoning. In this work, we intro- duce ToolVQA, a large-scale multimodal dataset compris- ing 23K samples, designed to bridge this gap. Unlike pre- vious datasets that rely on synthetic scenarios and sim- plified queries, ToolVQA features real-world visual con- texts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data genera- tion pipeline that employs image-guided Depth-First Search (DFS) with a Longest Common Subsequence (LCS)-based example matching mechanism to simulate human-like tool- use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse domains, with an average inference length of 2.78 reasoning steps per sample. The LLaVA-7B model fine-tuned on ToolVQA not only achieves impressive per- formance on the ToolVQA test set, but also surpasses the large closed-source model GPT-3.5-turbo on five out-of- distribution (OOD) datasets, showing strong generalizabil- ity in real-world tool-use scenarios. Code is available at https://github.com/Fugtemypt123/ToolVQA-release.

PDF ICCV Semantic Scholar

Cite

Text

Yin et al. "ToolVQA: A Dataset for Multi-Step Reasoning VQA with External Tools." International Conference on Computer Vision, 2025.

Markdown

[Yin et al. "ToolVQA: A Dataset for Multi-Step Reasoning VQA with External Tools." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/yin2025iccv-toolvqa/)

BibTeX

@inproceedings{yin2025iccv-toolvqa,
  title     = {{ToolVQA: A Dataset for Multi-Step Reasoning VQA with External Tools}},
  author    = {Yin, Shaofeng and Lei, Ting and Liu, Yang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {4424-4433},
  url       = {https://mlanthology.org/iccv/2025/yin2025iccv-toolvqa/}
}