SITE: Towards Spatial Intelligence Thorough Evaluation
Abstract
Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models' spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey of existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts, especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model's spatial reasoning proficiency and its performance on an embodied AI task.
Cite
Text
Wang et al. "SITE: Towards Spatial Intelligence Thorough Evaluation." International Conference on Computer Vision, 2025.Markdown
[Wang et al. "SITE: Towards Spatial Intelligence Thorough Evaluation." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/wang2025iccv-site/)BibTeX
@inproceedings{wang2025iccv-site,
title = {{SITE: Towards Spatial Intelligence Thorough Evaluation}},
author = {Wang, Wenqi and Tan, Reuben and Zhu, Pengyue and Yang, Jianwei and Yang, Zhengyuan and Wang, Lijuan and Kolobov, Andrey and Gao, Jianfeng and Gong, Boqing},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {9058-9069},
url = {https://mlanthology.org/iccv/2025/wang2025iccv-site/}
}