Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
Abstract
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that jointly learns visual representations and action policies within a single video-generative framework. At its core, GE-Base is a large-scale instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Building on this foundation, GE-Act employs a lightweight flow-matching decoder to map latent representations into executable action trajectories, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. Trained on over 1 million manipulation episodes, GE supports both short- and long-horizon tasks, and generalizes across embodiments. All code, models, and benchmarks will be released publicly.
Cite
Text
Liao et al. "Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation." International Conference on Learning Representations, 2026.Markdown
[Liao et al. "Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/liao2026iclr-genie/)BibTeX
@inproceedings{liao2026iclr-genie,
title = {{Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation}},
author = {Liao, Yue and Zhou, Pengfei and Huang, Siyuan and Yang, Donglin and Chen, Shengcong and Jiang, Yuxin and Hu, Yue and Liu, Si and Luo, Jianlan and Chen, Liliang and Yan, Shuicheng and Yao, Maoqing and Ren, Guanghui},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/liao2026iclr-genie/}
}