SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Abstract

Understanding and reasoning about spatial relationships is crucial for Visual Question Answering (VQA) and robotics. Vision Language Models (VLMs) have shown impressive performance in some VQA benchmarks but struggle with 3D spatial reasoning such as recognizing distances or size differences between physical objects. This limitation may stem from a lack of 3D spatial knowledge in their training data. To address this we propose training VLMs with extensive spatial reasoning data from the internet. Our approach includes developing an automatic 3D spatial VQA data generation framework capable of creating 2 billion VQA examples from 10 million real-world images. We explore various factors in the training process such as data quality training pipeline and VLM architecture. Our work introduces the first Internet-scale 3D spatial reasoning dataset in metric space. By co-training a VLM with this dataset we significantly improve its performance in both qualitative and quantitative spatial VQA. Additionally this enhanced VLM enables new applications in chain-of-thought spatial reasoning and robotics particularly in quantitative estimation.

Cite

Text

Chen et al. "SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01370

Markdown

[Chen et al. "SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/chen2024cvpr-spatialvlm/) doi:10.1109/CVPR52733.2024.01370

BibTeX

@inproceedings{chen2024cvpr-spatialvlm,
  title     = {{SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities}},
  author    = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brain and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {14455-14465},
  doi       = {10.1109/CVPR52733.2024.01370},
  url       = {https://mlanthology.org/cvpr/2024/chen2024cvpr-spatialvlm/}
}