Harnessing Webpage UIs for Text-Rich Visual Understanding
Abstract
Text-rich visual understanding—the ability to interpret both textual content and visual elements within a scene—is crucial for multimodal large language models (MLLMs) to effectively interact with structured environments. We propose leveraging webpage UIs as a naturally structured and diverse data source to enhance MLLMs’ capabilities in this area. Existing approaches, such as rule-based extraction, multimodal model captioning, and rigid HTML parsing, are hindered by issues like noise, hallucinations, and limited generalization. To overcome these challenges, we introduce MultiUI, a dataset of 7.3 million samples spanning various UI types and tasks, structured using enhanced accessibility trees and task taxonomies. By scaling multimodal instructions from web UIs through LLMs, our dataset enhances generalization beyond web domains, significantly improving performance in document understanding, GUI comprehension, grounding, and advanced agent tasks. This demonstrates the potential of structured web data to elevate MLLMs’ proficiency in processing text-rich visual environments and generalizing across domains.
Cite
Text
Liu et al. "Harnessing Webpage UIs for Text-Rich Visual Understanding." International Conference on Learning Representations, 2025.Markdown
[Liu et al. "Harnessing Webpage UIs for Text-Rich Visual Understanding." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/liu2025iclr-harnessing/)BibTeX
@inproceedings{liu2025iclr-harnessing,
title = {{Harnessing Webpage UIs for Text-Rich Visual Understanding}},
author = {Liu, Junpeng and Ou, Tianyue and Song, Yifan and Qu, Yuxiao and Lam, Wai and Xiong, Chenyan and Chen, Wenhu and Neubig, Graham and Yue, Xiang},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/liu2025iclr-harnessing/}
}