iVideoGPT: Interactive VideoGPTs Are Scalable World Models

Abstract

World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals—visual observations, actions, and rewards—into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.

Cite

Text

Wu et al. "iVideoGPT: Interactive VideoGPTs Are Scalable World Models." Neural Information Processing Systems, 2024. doi:10.52202/079017-2173

Markdown

[Wu et al. "iVideoGPT: Interactive VideoGPTs Are Scalable World Models." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/wu2024neurips-ivideogpt/) doi:10.52202/079017-2173

BibTeX

@inproceedings{wu2024neurips-ivideogpt,
  title     = {{iVideoGPT: Interactive VideoGPTs Are Scalable World Models}},
  author    = {Wu, Jialong and Yin, Shaofeng and Feng, Ningya and He, Xu and Li, Dong and Hao, Jianye and Long, Mingsheng},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-2173},
  url       = {https://mlanthology.org/neurips/2024/wu2024neurips-ivideogpt/}
}