Vid2World: Crafting Video Diffusion Models to Interactive World Models

Abstract

World models, which predict future transitions from past observation and action sequences, have shown great promise for improving data efficiency in sequential decision-making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their usefulness in complex environments. In contrast, video diffusion models trained on large-scale internet data have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present _Vid2World_, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World systematically explores _video diffusion causalization_, reshaping both the architecture and training objective of pre-trained models to enable autoregressive generation. Additionally, it incorporates a _causal action guidance_ mechanism to enhance action controllability in the resulting interactive world models. Extensive experiments across multiple domains, including robot manipulation, 3D game simulation, and open-world navigation, demonstrate that our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.

Cite

Text

Huang et al. "Vid2World: Crafting Video Diffusion Models to Interactive World Models." International Conference on Learning Representations, 2026.

Markdown

[Huang et al. "Vid2World: Crafting Video Diffusion Models to Interactive World Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/huang2026iclr-vid2world/)

BibTeX

@inproceedings{huang2026iclr-vid2world,
  title     = {{Vid2World: Crafting Video Diffusion Models to Interactive World Models}},
  author    = {Huang, Siqiao and Wu, Jialong and Zhou, Qixing and Miao, Shangchen and Long, Mingsheng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/huang2026iclr-vid2world/}
}