Web-CogReasoner: Towards Multimodal Knowledge-Induced Cognitive Reasoning for Web Agents

Abstract

Multimodal large-scale models have significantly advanced the development of web agents, enabling them to perceive and interact with the digital environment in a manner analogous to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to engage in cognitive reasoning effectively. Therefore, we decompose a web agent's capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, which categorizes knowledge into Factual, Conceptual, and Procedural domains. In this framework, knowledge content learning corresponds to the agent's processes of Memorizing and Understanding, which rely on the former two types of knowledge, respectively, representing the "what" of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the "how" of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to instill the core knowledge necessary for a web agent systematically. This dataset serves as the agent's conceptual grounding—the "nouns" upon which comprehension is built—as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed multimodal web agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, particularly in its capacity for generalization to unseen tasks where its structured knowledge proves decisive. To facilitate rigorous and systematic evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities. Our code and data are open sourced at https://github.com/Gnonymous/Web-CogReasoner.

Cite

Text

Guo et al. "Web-CogReasoner: Towards Multimodal Knowledge-Induced Cognitive Reasoning for Web Agents." International Conference on Learning Representations, 2026.

Markdown

[Guo et al. "Web-CogReasoner: Towards Multimodal Knowledge-Induced Cognitive Reasoning for Web Agents." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/guo2026iclr-webcogreasoner/)

BibTeX

@inproceedings{guo2026iclr-webcogreasoner,
  title     = {{Web-CogReasoner: Towards Multimodal Knowledge-Induced Cognitive Reasoning for Web Agents}},
  author    = {Guo, Yuhan and Guocong,  and Sun, Aiwen and He, Hongliang and Yang, Xinyu and Lu, Yue and Zhang, Yingji and Guo, Xuntao and Zhang, Dong and Liu, Jianzhuang and Duan, Jiang and Xiao, Yijia and Wen, Liangjian and Xu, Haiming and Dai, Yong},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/guo2026iclr-webcogreasoner/}
}