Embodied Scene Understanding for Vision Language Models via MetaVQA

Wang, Weizhen; Duan, Chenda; Peng, Zhenghao; Liu, Yuxin; Zhou, Bolei

doi:10.1109/CVPR52734.2025.02091

Embodied Scene Understanding for Vision Language Models via MetaVQA

Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, Bolei Zhou

CVPR 2025 pp. 22453-22464

doi:10.1109/CVPR52734.2025.02091 /cvpr/2025/wang2025cvpr-embodied/

Abstract

Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications. However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making capabilities is lacking. To address this, we present MetaVQA: a comprehensive benchmark designed to assess and enhance VLMs' understanding of spatial relationships and scene dynamics through Visual Question Answering (VQA) and closed-loop simulations. MetaVQA leverages Set-of-Mark prompting and top-down view ground-truth annotations from nuScenes and Waymo datasets to automatically generate extensive question-answer pairs based on diverse real-world traffic scenarios, ensuring object-centric and context-rich instructions. Our experiments show that fine-tuning VLMs with the MetaVQA Dataset significantly improves their embodied scene understanding, which is evident not only in improved VQA accuracy but also in emerging safety-aware driving maneuvers. In addition, the learning exhibits strong transferability from simulation to real-world observation. The project webpage is at https://metadriverse.github.io/metavqa.

PDF CVPR Semantic Scholar

Cite

Text

Wang et al. "Embodied Scene Understanding for Vision Language Models via MetaVQA." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02091

Markdown

[Wang et al. "Embodied Scene Understanding for Vision Language Models via MetaVQA." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/wang2025cvpr-embodied/) doi:10.1109/CVPR52734.2025.02091

BibTeX

@inproceedings{wang2025cvpr-embodied,
  title     = {{Embodied Scene Understanding for Vision Language Models via MetaVQA}},
  author    = {Wang, Weizhen and Duan, Chenda and Peng, Zhenghao and Liu, Yuxin and Zhou, Bolei},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {22453-22464},
  doi       = {10.1109/CVPR52734.2025.02091},
  url       = {https://mlanthology.org/cvpr/2025/wang2025cvpr-embodied/}
}