STEVE-1: A Generative Model for Text-to-Behavior in Minecraft (Abridged Version)
Abstract
Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL•E 2, is also effective for creating instruction-following sequential decision-making agents. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.
Cite
Text
Lifshitz et al. "STEVE-1: A Generative Model for Text-to-Behavior in Minecraft (Abridged Version)." NeurIPS 2023 Workshops: GCRL, 2023.Markdown
[Lifshitz et al. "STEVE-1: A Generative Model for Text-to-Behavior in Minecraft (Abridged Version)." NeurIPS 2023 Workshops: GCRL, 2023.](https://mlanthology.org/neuripsw/2023/lifshitz2023neuripsw-steve1-a/)BibTeX
@inproceedings{lifshitz2023neuripsw-steve1-a,
title = {{STEVE-1: A Generative Model for Text-to-Behavior in Minecraft (Abridged Version)}},
author = {Lifshitz, Shalev and Paster, Keiran and Chan, Harris and Ba, Jimmy and McIlraith, Sheila},
booktitle = {NeurIPS 2023 Workshops: GCRL},
year = {2023},
url = {https://mlanthology.org/neuripsw/2023/lifshitz2023neuripsw-steve1-a/}
}