Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation

ICLRW 2025

/iclrw/2025/gao2025iclrw-visionlanguage/

Abstract

Internal world models (WMs) enable agents to understand the world's state and predict transitions, serving as the basis for advanced deliberative reasoning. Recent large Vision-Language Models (VLMs), such as GPT-4o and Gemini, exhibit potential as general-purpose WMs. While the latest studies have evaluated and shown limitations in specific capabilities such as visual understanding, a systematic evaluation of VLMs' fundamental WM abilities remains absent. Drawing on comparative psychology and cognitive science, we propose a two-stage framework that assesses *Perception* (visual, spatial, temporal, quantitative, and motion) and *Prediction* (mechanistic simulation, transitive inference, compositional inference) to provide an atomic evaluation of VLMs as WMs. Guided by this framework, we introduce **WM-ABench**, a large-scale benchmark comprising 23 fine-grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations. Through 517 controlled experiments on 11 latest commercial and open-source VLMs, we find that these models exhibit striking limitations in basic world modeling abilities. For instance, all models perform at near-random accuracy when distinguishing motion trajectories. Additionally, they lack disentangled understanding---e.g., they tend to believe blue objects move faster than green ones. More rich results and analyses reveal significant gaps between VLMs and human-level world modeling.

PDF ICLRW OpenReview Semantic Scholar

Cite

Text

Gao et al. "Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation." ICLR 2025 Workshops: World_Models, 2025.

Markdown

[Gao et al. "Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation." ICLR 2025 Workshops: World_Models, 2025.](https://mlanthology.org/iclrw/2025/gao2025iclrw-visionlanguage/)

BibTeX

@inproceedings{gao2025iclrw-visionlanguage,
  title     = {{Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation}},
  author    = {Gao, Qiyue and Pi, Xinyu and Liu, Kevin and Chen, Junrong and Yang, Ruolan and Huang, Xinqi and Fang, Xinyu and Sun, Lu and Kishore, Gautham and Ai, Bo and Tao, Stone and Liu, Mengyang and Yang, Jiaxi and Lai, Chao-Jung and Jin, Chuanyang and Xiang, Jiannan and Huang, Benhao and Danks, David and Su, Hao and Shu, Tianmin and Ma, Ziqiao and Qin, Lianhui and Hu, Zhiting},
  booktitle = {ICLR 2025 Workshops: World_Models},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/gao2025iclrw-visionlanguage/}
}