Skill Transformer: A Monolithic Policy for Mobile Manipulation

Abstract

We present Skill Transformer, an approach for solving long-horizon robotic tasks by combining conditional sequence modeling and skill modularity. Conditioned on egocentric and proprioceptive observations of a robot, Skill Transformer is trained end-to-end to predict both a high-level skill (e.g., navigation, picking, placing), and a whole-body low-level action (e.g., base and arm motion), using a transformer architecture and demonstration trajectories that solve the full task. It retains the composability and modularity of the overall task through a skill predictor module while reasoning about low-level actions and avoiding hand-off errors, common in modular approaches. We test Skill Transformer on an embodied rearrangement benchmark and find it performs robust task planning and low-level control in new scenarios, achieving a 2.5x higher success rate than baselines in hard rearrangement problems.

Cite

Text

Huang et al. "Skill Transformer: A Monolithic Policy for Mobile Manipulation." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00996

Markdown

[Huang et al. "Skill Transformer: A Monolithic Policy for Mobile Manipulation." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/huang2023iccv-skill/) doi:10.1109/ICCV51070.2023.00996

BibTeX

@inproceedings{huang2023iccv-skill,
  title     = {{Skill Transformer: A Monolithic Policy for Mobile Manipulation}},
  author    = {Huang, Xiaoyu and Batra, Dhruv and Rai, Akshara and Szot, Andrew},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {10852-10862},
  doi       = {10.1109/ICCV51070.2023.00996},
  url       = {https://mlanthology.org/iccv/2023/huang2023iccv-skill/}
}