ITBench: Evaluating AI Agents Across Diverse Real-World IT Automation Tasks
Abstract
Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. IT-Bench includes an initial set of 102 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 11.4% of SRE scenarios, 25.2% of CISO scenarios, and 25.8% of FinOps scenarios (excluding anomaly detection). For FinOps-specific anomaly detection (AD) scenarios, AI agents achieve an F1 score of 0.35. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast. IT-Bench, along with a leaderboard and sample agent implementations, is available at https://github.com/ibm/itbench.
Cite
Text
Jha et al. "ITBench: Evaluating AI Agents Across Diverse Real-World IT Automation Tasks." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Jha et al. "ITBench: Evaluating AI Agents Across Diverse Real-World IT Automation Tasks." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/jha2025icml-itbench/)BibTeX
@inproceedings{jha2025icml-itbench,
title = {{ITBench: Evaluating AI Agents Across Diverse Real-World IT Automation Tasks}},
author = {Jha, Saurabh and Arora, Rohan R. and Watanabe, Yuji and Yanagawa, Takumi and Chen, Yinfang and Clark, Jackson and Bhavya, Bhavya and Verma, Mudit and Kumar, Harshit and Kitahara, Hirokuni and Zheutlin, Noah and Takano, Saki and Pathak, Divya and George, Felix and Wu, Xinbo and Turkkan, Bekir O and Vanloo, Gerard and Nidd, Michael and Dai, Ting and Chatterjee, Oishik and Gupta, Pranjal and Samanta, Suranjana and Aggarwal, Pooja and Lee, Rong and Ahn, Jae-Wook and Kar, Debanjana and Paradkar, Amit and Deng, Yu and Moogi, Pratibha and Mohapatra, Prateeti and Abe, Naoki and Narayanaswami, Chandrasekhar and Xu, Tianyin and Varshney, Lav R. and Mahindru, Ruchi and Sailer, Anca and Shwartz, Laura and Sow, Daby and Fuller, Nicholas C. M. and Puri, Ruchir},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {27134-27197},
volume = {267},
url = {https://mlanthology.org/icml/2025/jha2025icml-itbench/}
}