A3VLM: Actionable Articulation-Aware Vision Language Model
Abstract
Vision Language Models (VLMs) for robotics have received significant attention in recent years. As a VLM can understand robot observations and perform complex visual reasoning, it is regarded as a potential universal solution for general robotics challenges such as manipulation and navigation. However, previous robotics VLMs such as RT-1, RT-2, and ManipLLM have focused on directly learning robot actions. Such approaches require collecting a significant amount of robot interaction data, which is extremely costly in the real world. Thus, we propose A3VLM, an object-centric, actionable, articulation-aware vision language model. A3VLM focuses on the articulation structure and action affordances of objects. Its representation is robot-agnostic and can be translated into robot actions using simple action primitives. Extensive experiments in both simulation benchmarks and real-world settings demonstrate the effectiveness and stability of A3VLM.
Cite
Text
Huang et al. "A3VLM: Actionable Articulation-Aware Vision Language Model." Proceedings of The 8th Conference on Robot Learning, 2024.Markdown
[Huang et al. "A3VLM: Actionable Articulation-Aware Vision Language Model." Proceedings of The 8th Conference on Robot Learning, 2024.](https://mlanthology.org/corl/2024/huang2024corl-a3vlm/)BibTeX
@inproceedings{huang2024corl-a3vlm,
title = {{A3VLM: Actionable Articulation-Aware Vision Language Model}},
author = {Huang, Siyuan and Chang, Haonan and Liu, Yuhan and Zhu, Yimeng and Dong, Hao and Boularias, Abdeslam and Gao, Peng and Li, Hongsheng},
booktitle = {Proceedings of The 8th Conference on Robot Learning},
year = {2024},
pages = {1675-1690},
volume = {270},
url = {https://mlanthology.org/corl/2024/huang2024corl-a3vlm/}
}