The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Li, Junlong; Zhao, Wenshuo; Zhao, Jian; Zeng, Weihao; Wu, Haoze; Wang, Xiaochen; Ge, Rui; Cao, Yuxuan; Huang, Yuzhen; Liu, Wei; Liu, Junteng; Su, Zhaochen; Guo, Yiyang; Zhou, Fan; Zhang, Lueyang; Michelini, Juan; Wang, Xingyao; Yue, Xiang; Zhou, Shuyan; Neubig, Graham; He, Junxian

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

ICLR 2026

/iclr/2026/li2026iclr-tool/

Abstract

Real-world language agents must handle complex, multi-step workflows across diverse applications. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database like BigQuery to detect anomalies and generate reports following a standard operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse applications and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional applications like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real-world financial spreadsheets. The Toolathlon benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple applications over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of state-of-the-art models highlights their significant shortcomings in performing real-world, long-horizon tasks: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Li et al. "The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution." International Conference on Learning Representations, 2026.

Markdown

[Li et al. "The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-tool/)

BibTeX

@inproceedings{li2026iclr-tool,
  title     = {{The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution}},
  author    = {Li, Junlong and Zhao, Wenshuo and Zhao, Jian and Zeng, Weihao and Wu, Haoze and Wang, Xiaochen and Ge, Rui and Cao, Yuxuan and Huang, Yuzhen and Liu, Wei and Liu, Junteng and Su, Zhaochen and Guo, Yiyang and Zhou, Fan and Zhang, Lueyang and Michelini, Juan and Wang, Xingyao and Yue, Xiang and Zhou, Shuyan and Neubig, Graham and He, Junxian},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/li2026iclr-tool/}
}