PaLM-E: An Embodied Multimodal Language Model
Abstract
Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g. for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multimodal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.
Cite
Text
Driess et al. "PaLM-E: An Embodied Multimodal Language Model." International Conference on Machine Learning, 2023.Markdown
[Driess et al. "PaLM-E: An Embodied Multimodal Language Model." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/driess2023icml-palme/)BibTeX
@inproceedings{driess2023icml-palme,
title = {{PaLM-E: An Embodied Multimodal Language Model}},
author = {Driess, Danny and Xia, Fei and Sajjadi, Mehdi S. M. and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and Huang, Wenlong and Chebotar, Yevgen and Sermanet, Pierre and Duckworth, Daniel and Levine, Sergey and Vanhoucke, Vincent and Hausman, Karol and Toussaint, Marc and Greff, Klaus and Zeng, Andy and Mordatch, Igor and Florence, Pete},
booktitle = {International Conference on Machine Learning},
year = {2023},
pages = {8469-8488},
volume = {202},
url = {https://mlanthology.org/icml/2023/driess2023icml-palme/}
}