MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Wu, Zijian; Liu, Xiangyan; Zhang, Xinyuan; Chen, Lingjun; Meng, Fanqing; Du, Lingxiao; Zhao, Yiran; Zhang, Fanshi; Ye, Yaoqi; Wang, Jiawei; Wang, Zirui; Ni, Jinjie; Yang, Yufan; Xu, Arvin; Shieh, Michael Qizhe

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, Zirui Wang, Jinjie Ni, Yufan Yang, Arvin Xu, Michael Qizhe Shieh

ICLR 2026

/iclr/2026/wu2026iclr-mcpmark/

Abstract

The MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose \texttt{MCPMark}, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents, each with a curated initial state and programmatic verification script. These tasks demand diverse CRUD operations and richer environmental interactions. We evaluate cutting-edge LLMs using a minimal agent framework. The best-performing model, \texttt{gpt-5-medium}, reaches only $52.56$\% pass@1 and $33.86$\% pass\textasciicircum{}4, while other strong models including \texttt{claude-sonnet-4} and \texttt{o3} fall below $30$\% pass@1 and $15$\% pass\textasciicircum{}4. On average, LLMs require $16.2$ turns and $17.4$ tool calls per task, highlighting the stress-testing nature of \texttt{MCPMark}.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Wu et al. "MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use." International Conference on Learning Representations, 2026.

Markdown

[Wu et al. "MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wu2026iclr-mcpmark/)

BibTeX

@inproceedings{wu2026iclr-mcpmark,
  title     = {{MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use}},
  author    = {Wu, Zijian and Liu, Xiangyan and Zhang, Xinyuan and Chen, Lingjun and Meng, Fanqing and Du, Lingxiao and Zhao, Yiran and Zhang, Fanshi and Ye, Yaoqi and Wang, Jiawei and Wang, Zirui and Ni, Jinjie and Yang, Yufan and Xu, Arvin and Shieh, Michael Qizhe},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wu2026iclr-mcpmark/}
}