Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

Abstract

Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce \textbf{MV-RoboBench}, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating Chain-of-Thought (CoT)-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.

Cite

Text

Feng et al. "Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes." International Conference on Learning Representations, 2026.

Markdown

[Feng et al. "Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/feng2026iclr-seeing/)

BibTeX

@inproceedings{feng2026iclr-seeing,
  title     = {{Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes}},
  author    = {Feng, ZhiYuan and Kang, Zhaolu and Wang, Qijie and Du, Zhiying and Yan, Jiongrui and Shubin, Shi and Yuan, Chengbo and Liang, Huizhi and Deng, Yu and Li, Qixiu and Yang, Rushuai and An, Ruichuan and Zheng, Leqi and Wang, Weijie and Chen, Shuang and Xu, Sicheng and Liang, Yaobo and Yang, Jiaolong and Guo, Baining},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/feng2026iclr-seeing/}
}