InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models

Abstract

Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain constrained by limited scale, narrow visual diversity, and restricted instruction expressiveness. To address these gaps, we present InternSpatial---the largest open-source dataset for spatial reasoning in VLMs---alongside InternSpatial-Bench, a comprehensive evaluation benchmark designed to assess spatial understanding across diverse instruction formats. InternSpatial contains 12 million question-answer(QA) pairs covering both single-view and multi-view scenarios, sourced from varied visual environments and supporting 19 distinct instruction formats that mirror real-world query patterns. InternSpatial-Bench aims to single-view assessment and also extends multi-view reasoning through a novel rotation estimation task. Experimental validation demonstrates that models trained on \trainset achieve substantial performance improvement of 12.1% on InternSpatial-Bench and 10.7% on VSI-Bench, while preserving competitive performance on general-purpose benchmarks. We expect these resources can advance the development of spatially-capable VLMs for practical applications in robotics and embodied AI systems. Our codes and datasets are publicly available at https://github.com/dengnianchen/intern-spatial.

Cite

Text

Deng et al. "InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models." International Conference on Learning Representations, 2026.

Markdown

[Deng et al. "InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/deng2026iclr-internspatial/)

BibTeX

@inproceedings{deng2026iclr-internspatial,
  title     = {{InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models}},
  author    = {Deng, Nianchen and Gu, Lixin and Ye, Shenglong and He, Yinan and Chen, Zhe and Li, Songze and Wang, Haomin and Yin, Jinhui and Wei, Qi and Yang, Tianshuo and Dou, Min and He, Tong and Shao, Wenqi and Zhang, Kaipeng and Wang, Yi and Shi, Botian and Zhang, Yanting and Dai, Jifeng and Qiao, Yu and Wang, Wenhai and Zhang, Hongjie},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/deng2026iclr-internspatial/}
}