Benchmarking LLM Tool-Use in the Wild

Yu, Peijie; Liu, Wei; Yang, Yifan; Li, Jinjian; Zhang, Zelong; Feng, Xiao; Zhang, Feng

Benchmarking LLM Tool-Use in the Wild

Peijie Yu, Wei Liu, Yifan Yang, Jinjian Li, Zelong Zhang, Xiao Feng, Feng Zhang

ICLR 2026

/iclr/2026/yu2026iclr-benchmarking/

Abstract

Fulfilling user needs through Large Language Model multi-turn, multi-step tool-use is rarely a straightforward process. Real user interactions are inherently $\textbf{wild}$, being intricate, messy, and flexible. We identify three key challenges from user behaviour: $\textit{compositional tasks}$ that demand efficient orchestration of tool-call topologies, $\textit{implicit intent}$ spread across dialogue turns that require contextual inference, and $\textit{instruction transition}$, which mixes task queries, clarifications, and casual conversation, forcing LLMs to adjust their policies on the fly. Existing benchmarks overlook these behaviors, making the apparent progress of LLMs on tool-use spurious. To address this, we introduce $\textbf{\textit{WildToolBench}}$, an LLM tool-use benchmark grounded in real-world user behavior patterns. Comprehensive evaluations of 57 LLMs reveal that no model achieves an accuracy of more than 15\%, indicating a substantial gap in the robustness of LLMs' agentic ability. Controlled experiments and in-depth analyses further indicate that the real challenge for LLM tool-use lies not in artificially complex tasks, but in the wild nature of user behavior, emphasizing the need to reconsider the interactions among $\textit{LLMs}$, $\textit{users}$, and $\textit{tools}$.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Yu et al. "Benchmarking LLM Tool-Use in the Wild." International Conference on Learning Representations, 2026.

Markdown

[Yu et al. "Benchmarking LLM Tool-Use in the Wild." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/yu2026iclr-benchmarking/)

BibTeX

@inproceedings{yu2026iclr-benchmarking,
  title     = {{Benchmarking LLM Tool-Use in the Wild}},
  author    = {Yu, Peijie and Liu, Wei and Yang, Yifan and Li, Jinjian and Zhang, Zelong and Feng, Xiao and Zhang, Feng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/yu2026iclr-benchmarking/}
}