WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions

Srivastava, Sanjari; Li, Gang; Chang, Cheng; Garg, Rishu; Kaur, Manpreet; Lee, Charlene Y.; Li, Yuezhang; Mao, Yining; Cases, Ignacio; Xie, Yanan; Qi, Peng

WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions

Sanjari Srivastava, Gang Li, Cheng Chang, Rishu Garg, Manpreet Kaur, Charlene Y. Lee, Yuezhang Li, Yining Mao, Ignacio Cases, Yanan Xie, Peng Qi

ICLR 2026

/iclr/2026/srivastava2026iclr-warcbench/

Abstract

Training web agents to navigate complex, real-world websites requires them to master subtasks—short-horizon interactions on multiple UI components (e.g., choosing the correct date in a date picker, or scrolling in a container to extract information). We introduce WARC-Bench (Web Archive Benchmark), a novel web navigation benchmark featuring 438 tasks designed to evaluate multimodal AI agents on subtasks. WARC-Bench enables sandboxed interactions with dynamic and realistic webpages using Web ARChive files. We show that WARC-Bench is challenging for leading computer-use models, with the highest observed success rate being 64.8%. To improve open source models on subtask, we explore two common training techniques: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Experiments show that SFT models obtain a 48.8% success rate on the benchmark. Training with RLVR over SFT checkpoints, even in data-scarce settings, improves the score to 52.8% on WARC-Bench, outperforming many frontier models. Our analysis concludes that mastering these subtasks is essential for robust web planning and navigation, and is a capability not extensively evaluated by existing benchmarks. More details about WARC-Bench can be found at https://sanjari-orb.github.io/warc-bench/.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Srivastava et al. "WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions." International Conference on Learning Representations, 2026.

Markdown

[Srivastava et al. "WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/srivastava2026iclr-warcbench/)

BibTeX

@inproceedings{srivastava2026iclr-warcbench,
  title     = {{WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions}},
  author    = {Srivastava, Sanjari and Li, Gang and Chang, Cheng and Garg, Rishu and Kaur, Manpreet and Lee, Charlene Y. and Li, Yuezhang and Mao, Yining and Cases, Ignacio and Xie, Yanan and Qi, Peng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/srivastava2026iclr-warcbench/}
}