AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

Abstract

Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multi- modal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pre- training with large-scale data. However, they either utilize web data that differs from robotic tasks, or train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing limited generalization ability under in- sufficient robot data. In this paper, we propose to learn from large-scale human action video datasets in an explicit way (i.e., imitating human actions from hand keypoints), introduc- ing Visual Robot Manipulation with Analogical Reasoning (AR-VRM). To acquire action knowledge explicitly from hu- man action videos, we propose a keypoint Vision-Language Model (VLM) pretraining scheme, enabling the VLM to learn human action knowledge and directly predict human hand keypoints. During fine-tuning on robot data, to facilitate the robotic arm in imitating the action patterns of human motions, we first retrieve human action videos that perform similar manipulation tasks and have similar historical obser- vations, and then learn the Analogical Reasoning (AR) map between human hand keypoints and robot components. Tak- ing advantage of focusing on action keypoints instead of irrel- evant visual cues, our method achieves leading performance on the CALVIN benchmark and real-world experiments. In few-shot scenarios, our AR-VRM outperforms previous meth- ods by large margins, underscoring the effectiveness of explicitly imitating human actions under data scarcity. Code available at https://github.com/idejie/ar.

Cite

Text

Yang et al. "AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning." International Conference on Computer Vision, 2025.

Markdown

[Yang et al. "AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/yang2025iccv-arvrm/)

BibTeX

@inproceedings{yang2025iccv-arvrm,
  title     = {{AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning}},
  author    = {Yang, Dejie and Zhao, Zijing and Liu, Yang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {6818-6827},
  url       = {https://mlanthology.org/iccv/2025/yang2025iccv-arvrm/}
}