3D Question Answering via Only 2D Vision-Language Models
Abstract
Large vision-language models (LVLMs) have significantly advanced numerous fields. In this work, we explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example. Due to the limited training data in 3D, we do not train LVLMs but infer in a zero-shot manner. Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question. When the 2D model is chosen, e.g., LLAVA-OV, the quality of sampled views matters the most. We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA. cdViews consists of two key components: viewSelector prioritizing critical views based on their potential to provide answer-specific information, and viewNMS enhancing diversity by removing redundant views based on spatial overlap. We evaluate cdViews on the widely-used ScanQA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.
Cite
Text
Wang et al. "3D Question Answering via Only 2D Vision-Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Wang et al. "3D Question Answering via Only 2D Vision-Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/wang2025icml-3d/)BibTeX
@inproceedings{wang2025icml-3d,
title = {{3D Question Answering via Only 2D Vision-Language Models}},
author = {Wang, Fengyun and Yu, Sicheng and Wu, Jiawei and Tang, Jinhui and Zhang, Hanwang and Sun, Qianru},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {65310-65325},
volume = {267},
url = {https://mlanthology.org/icml/2025/wang2025icml-3d/}
}