Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation

Abstract

We seek to learn a generalizable goal-conditioned policy that enables diverse robot manipulation — interacting with unseen objects in novel scenes without test-time adaptation. While typical approaches rely on a large amount of demonstration data for such generalization, we propose an approach that leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world. Our framework, predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects. We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner. We then refine this open-loop plan by predicting residual actions through a closed loop policy trained with a few embodiment-specific demonstrations. We show that this approach of combining scalably learned track prediction with a residual policy requiring minimal in-domain robot-specific data enables diverse generalizable robot manipulation, and present a wide array of real-world robot manipulation results across unseen tasks, objects, and scenes. https://homangab.github.io/track2act/ 1 1∗ equal contribution. Correspondence to Homanga B. [email protected]

Cite

Text

Bharadhwaj et al. "Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73116-7_18

Markdown

[Bharadhwaj et al. "Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/bharadhwaj2024eccv-track2act/) doi:10.1007/978-3-031-73116-7_18

BibTeX

@inproceedings{bharadhwaj2024eccv-track2act,
  title     = {{Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation}},
  author    = {Bharadhwaj, Homanga and Mottaghi, Roozbeh and Gupta, Abhinav and Tulsiani, Shubham},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73116-7_18},
  url       = {https://mlanthology.org/eccv/2024/bharadhwaj2024eccv-track2act/}
}