Phantom: Training Robots Without Robots Using Only Human Videos
Abstract
Training general-purpose robots requires learning from large and diverse data sources. Current approaches rely heavily on teleoperated demonstrations which are difficult to scale. We present a scalable framework for training manipulation policies directly from human video demonstrations, requiring no robot data. Our method converts human demonstrations into robot-compatible observation-action pairs using hand pose estimation and visual data editing. We inpaint the human arm and overlay a rendered robot to align the visual domains. This enables zero-shot deployment on real hardware without any fine-tuning. We demonstrate strong success rates—up to 92%—on a range of tasks including deformable object manipulation, multi-object sweeping, and insertion. Our approach generalizes to novel environments and supports closed-loop execution. By demonstrating that effective policies can be trained using only human videos, our method broadens the path to scalable robot learning. Videos are available at https://phantom-training-robots.github.io.
Cite
Text
Lepert et al. "Phantom: Training Robots Without Robots Using Only Human Videos." Proceedings of The 9th Conference on Robot Learning, 2025.Markdown
[Lepert et al. "Phantom: Training Robots Without Robots Using Only Human Videos." Proceedings of The 9th Conference on Robot Learning, 2025.](https://mlanthology.org/corl/2025/lepert2025corl-phantom/)BibTeX
@inproceedings{lepert2025corl-phantom,
title = {{Phantom: Training Robots Without Robots Using Only Human Videos}},
author = {Lepert, Marion and Fang, Jiaying and Bohg, Jeannette},
booktitle = {Proceedings of The 9th Conference on Robot Learning},
year = {2025},
pages = {4545-4565},
volume = {305},
url = {https://mlanthology.org/corl/2025/lepert2025corl-phantom/}
}