A3VLM: Actionable Articulation-Aware Vision Language Model

Abstract

Vision Language Models (VLMs) for robotics have received significant attention in recent years. As a VLM can understand robot observations and perform complex visual reasoning, it is regarded as a potential universal solution for general robotics challenges such as manipulation and navigation. However, previous robotics VLMs such as RT-1, RT-2, and ManipLLM have focused on directly learning robot actions. Such approaches require collecting a significant amount of robot interaction data, which is extremely costly in the real world. Thus, we propose A3VLM, an object-centric, actionable, articulation-aware vision language model. A3VLM focuses on the articulation structure and action affordances of objects. Its representation is robot-agnostic and can be translated into robot actions using simple action primitives. Extensive experiments in both simulation benchmarks and real-world settings demonstrate the effectiveness and stability of A3VLM.

Cite

Text

Huang et al. "A3VLM: Actionable Articulation-Aware Vision Language Model." Proceedings of The 8th Conference on Robot Learning, 2024.

Markdown

[Huang et al. "A3VLM: Actionable Articulation-Aware Vision Language Model." Proceedings of The 8th Conference on Robot Learning, 2024.](https://mlanthology.org/corl/2024/huang2024corl-a3vlm/)

BibTeX

@inproceedings{huang2024corl-a3vlm,
  title     = {{A3VLM: Actionable Articulation-Aware Vision Language Model}},
  author    = {Huang, Siyuan and Chang, Haonan and Liu, Yuhan and Zhu, Yimeng and Dong, Hao and Boularias, Abdeslam and Gao, Peng and Li, Hongsheng},
  booktitle = {Proceedings of The 8th Conference on Robot Learning},
  year      = {2024},
  pages     = {1675-1690},
  volume    = {270},
  url       = {https://mlanthology.org/corl/2024/huang2024corl-a3vlm/}
}