ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
Abstract
Multi-modal large language models (MLLMs) are rapidly advancing in visual understanding and reasoning, enhancing GUI agents for tasks such as web browsing and mobile interactions. However, these agents depend on reasoning skills for action planning but only rely on the model capability for UI grounding (localizing the target element). These grounding models struggle with high-resolution displays, small targets, and complex environments. In this work, we introduce a novel method to improve MLLMs’ grounding performance in high-resolution, complex UI environments using a visual search approach based on visual reasoning. Additionally, we create a new benchmark, dubbed ScreenSpot-Pro, designed to comprehensively evaluate model capabilities in professional high-resolution settings. This benchmark consists of real-world high-resolution images and expert-annotated tasks from diverse professional domains. Our experiments show that existing GUI grounding models perform poorly on this dataset, with the best achieving only 18.9\%, whereas our visual-reasoning strategy significantly improves performance, reaching 48.1\% without any additional training.
Cite
Text
Li et al. "ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.Markdown
[Li et al. "ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.](https://mlanthology.org/iclrw/2025/li2025iclrw-screenspotpro/)BibTeX
@inproceedings{li2025iclrw-screenspotpro,
title = {{ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use}},
author = {Li, Kaixin and Ziyang, Meng and Lin, Hongzhan and Luo, Ziyang and Tian, Yuchen and Ma, Jing and Huang, Zhiyong and Chua, Tat-Seng},
booktitle = {ICLR 2025 Workshops: LLM_Reason_and_Plan},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/li2025iclrw-screenspotpro/}
}