iVideoGPT: Interactive VideoGPTs Are Scalable World Models
Abstract
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals—visual observations, actions, and rewards—into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.
Cite
Text
Wu et al. "iVideoGPT: Interactive VideoGPTs Are Scalable World Models." Neural Information Processing Systems, 2024. doi:10.52202/079017-2173Markdown
[Wu et al. "iVideoGPT: Interactive VideoGPTs Are Scalable World Models." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/wu2024neurips-ivideogpt/) doi:10.52202/079017-2173BibTeX
@inproceedings{wu2024neurips-ivideogpt,
title = {{iVideoGPT: Interactive VideoGPTs Are Scalable World Models}},
author = {Wu, Jialong and Yin, Shaofeng and Feng, Ningya and He, Xu and Li, Dong and Hao, Jianye and Long, Mingsheng},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-2173},
url = {https://mlanthology.org/neurips/2024/wu2024neurips-ivideogpt/}
}