LongHorizonUI: A Unified Framework for Robust Long-Horizon Task Automation of GUI Agent

Abstract

Although agents based on multimodal large language models (MLLMs) demonstrate proficiency in general short-term graphical user interface (GUI) tasks, their robustness remains a significant challenge for handling complex long-horizon tasks in dynamic environments . In response, the LongHorizonUI framework is proposed to improve the sustained reliability of agents in long-horizon GUI tasks. To overcome core limitations, we establish a comprehensive long-horizon benchmark, LongGUIBench, covering multiple categories of games and complex general applications, with long-horizon tasks defined as requiring more than 15 steps for rigorous evaluation of long-horizon reasoning capabilities. Based on this, a Multimodal Enhanced Perceiver is designed to incorporate element detection and text recognition models, assigning unique indices to interface elements, thereby reinforcing state representation. Furthermore, a Deep Reflection Decider engine is introduced, incorporating a structured multi-level feedback validation mechanism to enable progressive reasoning and ensure accurate action execution with predictable trajectories. Finally, we introduce a Compensatory Action Executor that combines multiple degradation compensation operations with a process rollback strategy based on execution progress monitoring to ensure operational effectiveness in long-horizon task logic. Experimental results demonstrate that LongHorizonUI achieves substantial long-horizon modeling improvements on LongGUIBench while retaining competitive performance on diverse public benchmarks. The code and models will be publicly available.

Cite

Text

Kang et al. "LongHorizonUI: A Unified Framework for Robust Long-Horizon Task Automation of GUI Agent." International Conference on Learning Representations, 2026.

Markdown

[Kang et al. "LongHorizonUI: A Unified Framework for Robust Long-Horizon Task Automation of GUI Agent." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/kang2026iclr-longhorizonui/)

BibTeX

@inproceedings{kang2026iclr-longhorizonui,
  title     = {{LongHorizonUI: A Unified Framework for Robust Long-Horizon Task Automation of GUI Agent}},
  author    = {Kang, Bin and Wen, Shaoguo and Bi, Yifei and Wu, Shunlong and Yuan, Xinbin and Shao, Rui and Wang, Junle and Tian, Zhuotao},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/kang2026iclr-longhorizonui/}
}