Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models

Abstract

Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the \emph{intrinsic-dynamic} spatial reasoning which is a fundamental aspect of human spatial cognition. In this paper, we propose a unified benchmark, \textbf{Spatial-DISE}, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants: \textbf{I}ntrinsic-\textbf{S}tatic, Intrinsic-\textbf{D}ynamic, \textbf{E}xtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover, to address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions, resulting in a new \textbf{Spatial-DISE} dataset that includes Spatial-DISE Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA pairs). Our comprehensive evaluation across 32 state-of-the-art VLMs reveals that, current VLMs have a large and consistent gap to human competence, especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a robust framework, valuable dataset, and clear direction for future research toward human-like spatial intelligence. Benchmark, dataset, and code are available at https://shinmohuang.github.io/spatialdise_page/Spatial-DISE .

Cite

Text

Huang et al. "Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models." International Conference on Learning Representations, 2026.

Markdown

[Huang et al. "Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/huang2026iclr-spatialdise/)

BibTeX

@inproceedings{huang2026iclr-spatialdise,
  title     = {{Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models}},
  author    = {Huang, Xinmiao and He, Qisong and Huang, Zhenglin and Wang, Boxuan and Li, Zhuoyun and Cheng, Guangliang and Dong, Yi and Huang, Xiaowei},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/huang2026iclr-spatialdise/}
}