Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

Abstract

Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively. Moreover, we introduce a high-quality Minecraft Goal-Observation-Action (MGOA) dataset, which contains 25,000 videos across 8 atomic tasks, providing about 30M goal-observation-action pairs. The automated construction method, along with the MGOA dataset, can contribute to the community's efforts in training Minecraft agents. Extensive experimental results demonstrate that Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft.

Cite

Text

Li et al. "Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00845

Markdown

[Li et al. "Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/li2025cvpr-optimus2/) doi:10.1109/CVPR52734.2025.00845

BibTeX

@inproceedings{li2025cvpr-optimus2,
  title     = {{Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy}},
  author    = {Li, Zaijing and Xie, Yuquan and Shao, Rui and Chen, Gongwei and Jiang, Dongmei and Nie, Liqiang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {9039-9049},
  doi       = {10.1109/CVPR52734.2025.00845},
  url       = {https://mlanthology.org/cvpr/2025/li2025cvpr-optimus2/}
}