Real-World Robot Learning with Masked Visual Pre-Training

Abstract

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.

Cite

Text

Radosavovic et al. "Real-World Robot Learning with Masked Visual Pre-Training." Conference on Robot Learning, 2022.

Markdown

[Radosavovic et al. "Real-World Robot Learning with Masked Visual Pre-Training." Conference on Robot Learning, 2022.](https://mlanthology.org/corl/2022/radosavovic2022corl-realworld/)

BibTeX

@inproceedings{radosavovic2022corl-realworld,
  title     = {{Real-World Robot Learning with Masked Visual Pre-Training}},
  author    = {Radosavovic, Ilija and Xiao, Tete and James, Stephen and Abbeel, Pieter and Malik, Jitendra and Darrell, Trevor},
  booktitle = {Conference on Robot Learning},
  year      = {2022},
  pages     = {416-426},
  volume    = {205},
  url       = {https://mlanthology.org/corl/2022/radosavovic2022corl-realworld/}
}