Grounding Multimodal Large Language Models in Actions

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, including both continuous and discrete actions. For continuous actions, a set of learned tokenizations that capture an action at various resolutions allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action grounding approaches on five different environments, encompassing over 114 embodied tasks.

Cite

Text

Szot et al. "Grounding Multimodal Large Language Models in Actions." Neural Information Processing Systems, 2024. doi:10.52202/079017-0638

Markdown

[Szot et al. "Grounding Multimodal Large Language Models in Actions." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/szot2024neurips-grounding/) doi:10.52202/079017-0638

BibTeX

@inproceedings{szot2024neurips-grounding,
  title     = {{Grounding Multimodal Large Language Models in Actions}},
  author    = {Szot, Andrew and Mazoure, Bogdan and Agrawal, Harsh and Hjelm, Devon and Kira, Zsolt and Toshev, Alexander},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-0638},
  url       = {https://mlanthology.org/neurips/2024/szot2024neurips-grounding/}
}