ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

Kokane, Shirley; Zhu, Ming; Awalgaonkar, Tulika Manoj; Zhang, Jianguo; Prabhakar, Akshara; Hoang, Thai Quoc; Liu, Zuxin; Rithesh, R N; Yang, Liangwei; Yao, Weiran; Tan, Juntao; Liu, Zhiwei; Wang, Huan; Niebles, Juan Carlos; Heinecke, Shelby; Xiong, Caiming; Savarese, Silvio

ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

Shirley Kokane, Ming Zhu, Tulika Manoj Awalgaonkar, Jianguo Zhang, Akshara Prabhakar, Thai Quoc Hoang, Zuxin Liu, R N Rithesh, Liangwei Yang, Weiran Yao, Juntao Tan, Zhiwei Liu, Huan Wang, Juan Carlos Niebles, Shelby Heinecke, Caiming Xiong, Silvio Savarese

ICLRW 2025

/iclrw/2025/kokane2025iclrw-toolscan/

Abstract

Evaluating Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce ToolScan, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using ToolScan, we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use these insights from ToolScan to guide their error mitigation strategies. We open-source our evaluation framework at https://anonymous.4open.science/r/ToolScan-1474 .

PDF ICLRW OpenReview Semantic Scholar

Cite

Text

Kokane et al. "ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs." ICLR 2025 Workshops: BuildingTrust, 2025.

Markdown

[Kokane et al. "ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/kokane2025iclrw-toolscan/)

BibTeX

@inproceedings{kokane2025iclrw-toolscan,
  title     = {{ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs}},
  author    = {Kokane, Shirley and Zhu, Ming and Awalgaonkar, Tulika Manoj and Zhang, Jianguo and Prabhakar, Akshara and Hoang, Thai Quoc and Liu, Zuxin and Rithesh, R N and Yang, Liangwei and Yao, Weiran and Tan, Juntao and Liu, Zhiwei and Wang, Huan and Niebles, Juan Carlos and Heinecke, Shelby and Xiong, Caiming and Savarese, Silvio},
  booktitle = {ICLR 2025 Workshops: BuildingTrust},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/kokane2025iclrw-toolscan/}
}