End-to-End Learning of Semantic Grasping

Abstract

We consider the task of semantic robotic grasping, in which a robot picks up an object of a user-specified class using only monocular images. Inspired by the two-stream hypothesis of visual reasoning, we present a semantic grasping framework that learns object detection, classification, and grasp planning in an end-to-end fashion. A "ventral stream" recognizes object class while a "dorsal stream" simultaneously interprets the geometric relationships necessary to execute successful grasps. We leverage the autonomous data collection capabilities of robots to obtain a large self-supervised dataset for training the dorsal stream, and use semi-supervised label propagation to train the ventral stream with only a modest amount of human supervision. We experimentally show that our approach improves upon grasping systems whose components are not learned end-to-end, including a baseline method that uses bounding box detection. Furthermore, we show that jointly training our model with auxiliary data consisting of non-semantic grasping data, as well as semantically labeled images without grasp actions, has the potential to substantially improve semantic grasping performance.

Cite

Text

Jang et al. "End-to-End Learning of Semantic Grasping." Conference on Robot Learning, 2017.

Markdown

[Jang et al. "End-to-End Learning of Semantic Grasping." Conference on Robot Learning, 2017.](https://mlanthology.org/corl/2017/jang2017corl-end/)

BibTeX

@inproceedings{jang2017corl-end,
  title     = {{End-to-End Learning of Semantic Grasping}},
  author    = {Jang, Eric and Vijayanarasimhan, Sudheendra and Pastor, Peter and Ibarz, Julian and Levine, Sergey},
  booktitle = {Conference on Robot Learning},
  year      = {2017},
  pages     = {119-132},
  url       = {https://mlanthology.org/corl/2017/jang2017corl-end/}
}