Thoughts and Lessons on Using Visual Foundation Models for Manipulation
Abstract
Training vision-based robotic systems from scratch is both computationally expensive and memory intensive. To mitigate these challenges, recent approaches forgo end-to-end training in favor of adopting visual representations from visual foundation models -- large scale models designed for broad task transferability. Recent years have seen numerous vision foundation models emerge, including several designed specifically for manipulation tasks. However, we still lack clear principles for what makes these models effective for robotics applications. To address this gap, we systematically evaluate vision foundation models to understand what makes them effective for offline robotic learning. We find that across eleven diverse vision encoders, a representation's ability to reconstruct edges and predict keypoints strongly correlates with its performance on manipulation tasks. Extensive correlation analysis across 21 manipulation tasks consistently shows that representations preserving edge and keypoint information achieve the highest environment success rates. These findings appear to challenge conventional wisdom about holistic reconstruction-based pretraining and offer a new lens for understanding what makes vision representations effective for robotics.
Cite
Text
Chen et al. "Thoughts and Lessons on Using Visual Foundation Models for Manipulation." Transactions on Machine Learning Research, 2025.Markdown
[Chen et al. "Thoughts and Lessons on Using Visual Foundation Models for Manipulation." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/chen2025tmlr-thoughts/)BibTeX
@article{chen2025tmlr-thoughts,
title = {{Thoughts and Lessons on Using Visual Foundation Models for Manipulation}},
author = {Chen, Ryan and Pang, Ziteng and Stadie, Bradly C.},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/chen2025tmlr-thoughts/}
}