STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

Abstract

The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To address this gap, we introduce ST-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.

Cite

Text

Li et al. "STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?." International Conference on Computer Vision, 2025.

Markdown

[Li et al. "STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/li2025iccv-stibench/)

BibTeX

@inproceedings{li2025iccv-stibench,
  title     = {{STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?}},
  author    = {Li, Yun and Zhang, Yiming and Lin, Tao and Liu, Xiangrui and Cai, Wenxiao and Liu, Zheng and Zhao, Bo},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {5622-5632},
  url       = {https://mlanthology.org/iccv/2025/li2025iccv-stibench/}
}