R3M: A Universal Visual Representation for Robot Manipulation
Abstract
We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations.
Cite
Text
Nair et al. "R3M: A Universal Visual Representation for Robot Manipulation." Conference on Robot Learning, 2022.Markdown
[Nair et al. "R3M: A Universal Visual Representation for Robot Manipulation." Conference on Robot Learning, 2022.](https://mlanthology.org/corl/2022/nair2022corl-r3m/)BibTeX
@inproceedings{nair2022corl-r3m,
title = {{R3M: A Universal Visual Representation for Robot Manipulation}},
author = {Nair, Suraj and Rajeswaran, Aravind and Kumar, Vikash and Finn, Chelsea and Gupta, Abhinav},
booktitle = {Conference on Robot Learning},
year = {2022},
pages = {892-909},
volume = {205},
url = {https://mlanthology.org/corl/2022/nair2022corl-r3m/}
}