The BrowserGym Ecosystem for Web Agent Research

Abstract

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) introduced BrowserGym which aims to solve this by providing a unified, gym-like environment with well-defined observation and actionspaces, facilitating standardized evaluation across diverse benchmarks. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature and includes AgentLab, a complementary framework that aids in agent creation, testing, and analysis. Our proposed ecosystem offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic’s latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

Cite

Text

de Chezelles et al. "The BrowserGym Ecosystem for Web Agent Research." Transactions on Machine Learning Research, 2025.

Markdown

[de Chezelles et al. "The BrowserGym Ecosystem for Web Agent Research." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/dechezelles2025tmlr-browsergym/)

BibTeX

@article{dechezelles2025tmlr-browsergym,
  title     = {{The BrowserGym Ecosystem for Web Agent Research}},
  author    = {de Chezelles, Thibault Le Sellier and Gasse, Maxime and Lacoste, Alexandre and Caccia, Massimo and Drouin, Alexandre and Boisvert, Léo and Thakkar, Megh and Marty, Tom and Assouel, Rim and Shayegan, Sahar Omidi and Jang, Lawrence Keunho and Lù, Xing Han and Yoran, Ori and Kong, Dehan and Xu, Frank F. and Reddy, Siva and Neubig, Graham and Cappart, Quentin and Salakhutdinov, Russ and Chapados, Nicolas},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/dechezelles2025tmlr-browsergym/}
}