Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Merrill, Mike A; Shaw, Alexander Glenn; Carlini, Nicholas; Li, Boxuan; Raj, Harsh; Bercovich, Ivan; Shi, Lin; Shin, Jeong Yeon; Walshe, Thomas; Buchanan, E. Kelly; Shen, Junhong; Ye, Guanghao; Lin, Haowei; Poulos, Jason; Wang, Maoyu; Nezhurina, Marianna; Lu, Di; Mastromichalakis, Orfeas Menis; Xu, Zhiwei; Chen, Zizhao; Liu, Yue; Zhang, Robert; Chen, Leon Liangyu; Kashyap, Anurag; Uslu, Jan-Lucas; Li, Jeffrey; Wu, Jianbo; Yan, Minghao; Bian, Song; Sharma, Vedang; Sun, Ke; Dillmann, Steven; Anand, Akshay; Lanpouthakoun, Andrew; Koopah, Bardia; Hu, Changran; Guha, Etash Kumar; Dreiman, Gabriel H. S.; Zhu, Jiacheng; Krauth, Karl; Zhong, Li; Muennighoff, Niklas; Amanfu, Robert Kwesi; Tan, Shangyin; Pimpalgaonkar, Shreyas; Aggarwal, Tushar; Lin, Xiangning; Lan, Xin; Zhao, Xuandong; Liang, Yiqing; Wang, Yuanli; Wang, Zilong; Zhou, Changzhi; Heineman, David; Liu, Hange; Trivedi, Harsh; Yang, John; Lin, Junhong; Shetty, Manish; Yang, Michael; Omi, Nabil; Raoof, Negin; Li, Shanda; Zhuo, Terry Yue; Lin, Wuwei; Dai, Yiwei; Wang, Yuxin; Chai, Wenhao; Zhou, Shang; Wahdany, Dariush; She, Ziyu; Hu, Jiaming; Dong, Zhikang; Zhu, Yuxuan; Cui, Sasha; Saiyed, Ahson; Kolbeinsson, Arinbjörn; Rytting, Christopher Michael; Marten, Ryan; Wang, Yixin; Jitsev, Jenia; Dimakis, Alex; Konwinski, Andy; Schmidt, Ludwig

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander Glenn Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, Anurag Kashyap, Jan-Lucas Uslu, Jeffrey Li, Jianbo Wu, Minghao Yan, Song Bian, Vedang Sharma, Ke Sun, Steven Dillmann, Akshay Anand, Andrew Lanpouthakoun, Bardia Koopah, Changran Hu, Etash Kumar Guha, Gabriel H. S. Dreiman, Jiacheng Zhu, Karl Krauth, Li Zhong, Niklas Muennighoff, Robert Kwesi Amanfu, Shangyin Tan, Shreyas Pimpalgaonkar, Tushar Aggarwal, Xiangning Lin, Xin Lan, Xuandong Zhao, Yiqing Liang, Yuanli Wang, Zilong Wang, Changzhi Zhou, David Heineman, Hange Liu, Harsh Trivedi, John Yang, Junhong Lin, Manish Shetty, Michael Yang, Nabil Omi, Negin Raoof, Shanda Li, Terry Yue Zhuo, Wuwei Lin, Yiwei Dai, Yuxin Wang, Wenhao Chai, Shang Zhou, Dariush Wahdany, Ziyu She, Jiaming Hu, Zhikang Dong, Yuxuan Zhu, Sasha Cui, Ahson Saiyed, Arinbjörn Kolbeinsson, Christopher Michael Rytting, Ryan Marten, Yixin Wang, Jenia Jitsev, Alex Dimakis, Andy Konwinski, Ludwig Schmidt

ICLR 2026

/iclr/2026/merrill2026iclr-terminalbench/

Abstract

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at tbench.ai.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Merrill et al. "Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces." International Conference on Learning Representations, 2026.

Markdown

[Merrill et al. "Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/merrill2026iclr-terminalbench/)

BibTeX

@inproceedings{merrill2026iclr-terminalbench,
  title     = {{Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces}},
  author    = {Merrill, Mike A and Shaw, Alexander Glenn and Carlini, Nicholas and Li, Boxuan and Raj, Harsh and Bercovich, Ivan and Shi, Lin and Shin, Jeong Yeon and Walshe, Thomas and Buchanan, E. Kelly and Shen, Junhong and Ye, Guanghao and Lin, Haowei and Poulos, Jason and Wang, Maoyu and Nezhurina, Marianna and Lu, Di and Mastromichalakis, Orfeas Menis and Xu, Zhiwei and Chen, Zizhao and Liu, Yue and Zhang, Robert and Chen, Leon Liangyu and Kashyap, Anurag and Uslu, Jan-Lucas and Li, Jeffrey and Wu, Jianbo and Yan, Minghao and Bian, Song and Sharma, Vedang and Sun, Ke and Dillmann, Steven and Anand, Akshay and Lanpouthakoun, Andrew and Koopah, Bardia and Hu, Changran and Guha, Etash Kumar and Dreiman, Gabriel H. S. and Zhu, Jiacheng and Krauth, Karl and Zhong, Li and Muennighoff, Niklas and Amanfu, Robert Kwesi and Tan, Shangyin and Pimpalgaonkar, Shreyas and Aggarwal, Tushar and Lin, Xiangning and Lan, Xin and Zhao, Xuandong and Liang, Yiqing and Wang, Yuanli and Wang, Zilong and Zhou, Changzhi and Heineman, David and Liu, Hange and Trivedi, Harsh and Yang, John and Lin, Junhong and Shetty, Manish and Yang, Michael and Omi, Nabil and Raoof, Negin and Li, Shanda and Zhuo, Terry Yue and Lin, Wuwei and Dai, Yiwei and Wang, Yuxin and Chai, Wenhao and Zhou, Shang and Wahdany, Dariush and She, Ziyu and Hu, Jiaming and Dong, Zhikang and Zhu, Yuxuan and Cui, Sasha and Saiyed, Ahson and Kolbeinsson, Arinbjörn and Rytting, Christopher Michael and Marten, Ryan and Wang, Yixin and Jitsev, Jenia and Dimakis, Alex and Konwinski, Andy and Schmidt, Ludwig},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/merrill2026iclr-terminalbench/}
}