4D-Bench: Benchmarking Multi-Modal Large Language Models for 4D Object Understanding

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities.However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects.In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning.4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks.With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs.The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding.4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63% accuracy compared to the human baseline of 91%.These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.

Cite

Text

Zhu et al. "4D-Bench: Benchmarking Multi-Modal Large Language Models for 4D Object Understanding." International Conference on Computer Vision, 2025.

Markdown

[Zhu et al. "4D-Bench: Benchmarking Multi-Modal Large Language Models for 4D Object Understanding." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zhu2025iccv-4dbench/)

BibTeX

@inproceedings{zhu2025iccv-4dbench,
  title     = {{4D-Bench: Benchmarking Multi-Modal Large Language Models for 4D Object Understanding}},
  author    = {Zhu, Wenxuan and Li, Bing and Zheng, Cheng and Mai, Jinjie and Chen, Jun and Jiang, Letian and Hamdi, Abdullah and Martinez, Sara Rojas and Lin, Chia-Wen and Elhoseiny, Mohamed and Ghanem, Bernard},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {21129-21143},
  url       = {https://mlanthology.org/iccv/2025/zhu2025iccv-4dbench/}
}