UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction as Reasoning

Chen, Liangyu; Zhou, Hanzhang; Cai, Chenglin; Zhang, Jianan; Tong, Panrong; Zhang, Xu; Kong, Quyu; Liu, Chen; Liu, Yuqi; Wang, Wenxuan; Wang, Yue; Jin, Qin; Hoi, Steven

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction as Reasoning

Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Xu Zhang, Quyu Kong, Chen Liu, Yuqi Liu, Wenxuan Wang, Yue Wang, Qin Jin, Steven Hoi

ICLR 2026

/iclr/2026/chen2026iclr-uiins/

Abstract

GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and models are released.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Chen et al. "UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction as Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Chen et al. "UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction as Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/chen2026iclr-uiins/)

BibTeX

@inproceedings{chen2026iclr-uiins,
  title     = {{UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction as Reasoning}},
  author    = {Chen, Liangyu and Zhou, Hanzhang and Cai, Chenglin and Zhang, Jianan and Tong, Panrong and Zhang, Xu and Kong, Quyu and Liu, Chen and Liu, Yuqi and Wang, Wenxuan and Wang, Yue and Jin, Qin and Hoi, Steven},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/chen2026iclr-uiins/}
}