PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability

Weijie Zhou, Manli Tao, Chaoyang Zhao, Haiyun Guo, Honghui Dong, Ming Tang, Jinqiao Wang

CVPR 2025 pp. 6940-6949

doi:10.1109/CVPR52734.2025.00651 /cvpr/2025/zhou2025cvpr-physvlm/

Abstract

Understanding the environment and a robot's physical reachability is crucial for task execution. While state-of-the-art vision-language models (VLMs) excel in environmental perception, they often generate inaccurate or impractical responses in embodied visual reasoning tasks due to a lack of understanding of robotic physical reachability. To address this issue, we propose a unified representation of physical reachability across diverse robots, i.e., Space-Physical Reachability Map (S-P Map), and PhysVLM, a vision-language model that integrates this reachability information into visual reasoning. Specifically, the S-P Map abstracts a robot's physical reachability into a generalized spatial representation, independent of specific robot configurations, allowing the model to focus on reachability features rather than robot-specific parameters. Subsequently, PhysVLM extends traditional VLM architectures by incorporating an additional feature encoder to process the S-P Map, enabling the model to reason about physical reachability without compromising its general vision-language capabilities. To train and evaluate PhysVLM, we constructed a large-scale multi-robot dataset, Phys100K, and a challenging benchmark, EQA-phys, which includes tasks for six different robots in both simulated and real-world environments. Experimental results demonstrate that PhysVLM outperforms existing models, achieving a 14% improvement over GPT-4o on EQA-phys and surpassing advanced embodied VLMs such as RoboMamba and SpatialVLM on the RoboVQA-val and OpenEQA benchmarks. Additionally, the S-P Map shows strong compatibility with various VLMs, and its integration into GPT-4o-mini yields a 7.1% performance improvement.

PDF CVPR Semantic Scholar

Cite

Text

Zhou et al. "PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00651

Markdown

[Zhou et al. "PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/zhou2025cvpr-physvlm/) doi:10.1109/CVPR52734.2025.00651

BibTeX

@inproceedings{zhou2025cvpr-physvlm,
  title     = {{PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability}},
  author    = {Zhou, Weijie and Tao, Manli and Zhao, Chaoyang and Guo, Haiyun and Dong, Honghui and Tang, Ming and Wang, Jinqiao},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {6940-6949},
  doi       = {10.1109/CVPR52734.2025.00651},
  url       = {https://mlanthology.org/cvpr/2025/zhou2025cvpr-physvlm/}
}