CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

Abstract

AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of 34 zero-day vulnerabilities and 18 historically incomplete patches. These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.

Cite

Text

Wang et al. "CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale." International Conference on Learning Representations, 2026.

Markdown

[Wang et al. "CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wang2026iclr-cybergym/)

BibTeX

@inproceedings{wang2026iclr-cybergym,
  title     = {{CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale}},
  author    = {Wang, Zhun and Shi, Tianneng and He, Jingxuan and Cai, Matthew and Zhang, Jialin and Song, Dawn},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wang2026iclr-cybergym/}
}