WebCanvas: Benchmarking Web Agents in Online Environments
Abstract
For web agents to be practically useful, they need to generalize to the ever changing web environment --- UI updates, page content updates, etc. Unfortunately, most traditional benchmarks only capture a static state of the web page. We introduce WebCanvas, an innovative online evaluation framework for web agents designed to address the dynamic nature of web interactions. WebCanvas contains three main components supporting realistic assessments: (1) A key-node-based evaluation metric, which stably capture critical actions or states necessary for task completions while disregarding noises caused by insignificant events or changed web-elements; (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines, which allows us to maintain the high-quality, up-to-date dataset and automatically detection shifts in live action sequences. Despite the advancements, best-performing model achieves only a 23.1% task success rate, highlighting substantial room for improvement in future work.
Cite
Text
Pan et al. "WebCanvas: Benchmarking Web Agents in Online Environments." ICML 2024 Workshops: Agentic_Markets, 2024.Markdown
[Pan et al. "WebCanvas: Benchmarking Web Agents in Online Environments." ICML 2024 Workshops: Agentic_Markets, 2024.](https://mlanthology.org/icmlw/2024/pan2024icmlw-webcanvas/)BibTeX
@inproceedings{pan2024icmlw-webcanvas,
title = {{WebCanvas: Benchmarking Web Agents in Online Environments}},
author = {Pan, Yichen and Kong, Dehan and Zhou, Sida and Cui, Cheng and Leng, Yifei and Jiang, Bing and Liu, Hangyu and Shang, Yanyi and Zhou, Shuyan and Wu, Tongshuang and Wu, Zhengyang},
booktitle = {ICML 2024 Workshops: Agentic_Markets},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/pan2024icmlw-webcanvas/}
}