VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Abstract
Videos are often used to learn or extract the necessary information to complete tasks in ways different than what text or static imagery can provide. However, many existing agent benchmarks neglect long-context video understanding, instead focus- ing on text or static image inputs. To bridge this gap, we introduce VideoWebArena (VideoWA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. VideoWA consists of 2,021 web agent tasks based on manually crafted video tutorials, which total almost four hours of content. For our benchmark, we define a taxonomy of long-context video-based agent tasks with two main areas of focus: skill retention and factual retention. While skill retention tasks evaluate whether an agent can use a given human demonstration to complete a task efficiently, the factual retention task evaluates whether an agent can retrieve instruction-relevant information from a video to complete a task. We find that the best model achieves a 13.3% success rate on factual retention tasks and 45.8% on factual retention QA pairs—far below human success rates of 73.9% and 79.3%, respectively. On skill retention tasks, long-context models perform worse with tutorials than without, exhibiting a 5% performance decrease in WebArena tasks and a 10.3% decrease in VisualWebArena tasks. Our work highlights performance gaps in the agentic abilities of long-context multimodal models and provides as a testbed for the future development of long-context video agents.
Cite
Text
Jang et al. "VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks." International Conference on Learning Representations, 2025.Markdown
[Jang et al. "VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/jang2025iclr-videowebarena/)BibTeX
@inproceedings{jang2025iclr-videowebarena,
title = {{VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks}},
author = {Jang, Lawrence Keunho and Li, Yinheng and Zhao, Dan and Ding, Charles and Lin, Justin and Liang, Paul Pu and Bonatti, Rogerio and Koishida, Kazuhito},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/jang2025iclr-videowebarena/}
}