Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Abstract
AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at tbench.ai.
Cite
Text
Merrill et al. "Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces." International Conference on Learning Representations, 2026.Markdown
[Merrill et al. "Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/merrill2026iclr-terminalbench/)BibTeX
@inproceedings{merrill2026iclr-terminalbench,
title = {{Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces}},
author = {Merrill, Mike A and Shaw, Alexander Glenn and Carlini, Nicholas and Li, Boxuan and Raj, Harsh and Bercovich, Ivan and Shi, Lin and Shin, Jeong Yeon and Walshe, Thomas and Buchanan, E. Kelly and Shen, Junhong and Ye, Guanghao and Lin, Haowei and Poulos, Jason and Wang, Maoyu and Nezhurina, Marianna and Lu, Di and Mastromichalakis, Orfeas Menis and Xu, Zhiwei and Chen, Zizhao and Liu, Yue and Zhang, Robert and Chen, Leon Liangyu and Kashyap, Anurag and Uslu, Jan-Lucas and Li, Jeffrey and Wu, Jianbo and Yan, Minghao and Bian, Song and Sharma, Vedang and Sun, Ke and Dillmann, Steven and Anand, Akshay and Lanpouthakoun, Andrew and Koopah, Bardia and Hu, Changran and Guha, Etash Kumar and Dreiman, Gabriel H. S. and Zhu, Jiacheng and Krauth, Karl and Zhong, Li and Muennighoff, Niklas and Amanfu, Robert Kwesi and Tan, Shangyin and Pimpalgaonkar, Shreyas and Aggarwal, Tushar and Lin, Xiangning and Lan, Xin and Zhao, Xuandong and Liang, Yiqing and Wang, Yuanli and Wang, Zilong and Zhou, Changzhi and Heineman, David and Liu, Hange and Trivedi, Harsh and Yang, John and Lin, Junhong and Shetty, Manish and Yang, Michael and Omi, Nabil and Raoof, Negin and Li, Shanda and Zhuo, Terry Yue and Lin, Wuwei and Dai, Yiwei and Wang, Yuxin and Chai, Wenhao and Zhou, Shang and Wahdany, Dariush and She, Ziyu and Hu, Jiaming and Dong, Zhikang and Zhu, Yuxuan and Cui, Sasha and Saiyed, Ahson and Kolbeinsson, Arinbjörn and Rytting, Christopher Michael and Marten, Ryan and Wang, Yixin and Jitsev, Jenia and Dimakis, Alex and Konwinski, Andy and Schmidt, Ludwig},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/merrill2026iclr-terminalbench/}
}