Mitigating the Human-Robot Domain Discrepancy in Visual Pre-Training for Robotic Manipulation

Abstract

Learning generalizable visual representations across different embodied environments is essential for effective robotic manipulation in real-world scenarios. However, the limited scale and diversity of robot demonstration data pose a significant challenge. Recent research has explored leveraging large-scale human activity data for pre-training, but the substantial morphological differences between humans and robots introduce a significant human-robot domain discrepancy, hindering the generalization of these models to downstream manipulation tasks. To overcome this, we propose a novel adaptation paradigm that leverages readily available paired human-robot video data to bridge the domain gap. Our method employs a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robot domain in a parameter-efficient manner. Experiments on 20 simulated tasks across two different benchmarks and five real-world tasks demonstrate significant improvements. These results span both single-task and language-conditioned multi-task settings, evaluated using two different pre-trained models. Compared to existing pre-trained models, our adaptation method improves the average success rate by over 7% across multiple tasks on both simulated benchmarks and real-world evaluations.

Cite

Text

Zhou et al. "Mitigating the Human-Robot Domain Discrepancy in Visual Pre-Training for Robotic Manipulation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02100

Markdown

[Zhou et al. "Mitigating the Human-Robot Domain Discrepancy in Visual Pre-Training for Robotic Manipulation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/zhou2025cvpr-mitigating/) doi:10.1109/CVPR52734.2025.02100

BibTeX

@inproceedings{zhou2025cvpr-mitigating,
  title     = {{Mitigating the Human-Robot Domain Discrepancy in Visual Pre-Training for Robotic Manipulation}},
  author    = {Zhou, Jiaming and Ma, Teli and Lin, Kun-Yu and Wang, Zifan and Qiu, Ronghe and Liang, Junwei},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {22551-22561},
  doi       = {10.1109/CVPR52734.2025.02100},
  url       = {https://mlanthology.org/cvpr/2025/zhou2025cvpr-mitigating/}
}