Vid2World: Crafting Video Diffusion Models to Interactive World Models

Huang, Siqiao; Wu, Jialong; Zhou, Qixing; Miao, Shangchen; Long, Mingsheng

Vid2World: Crafting Video Diffusion Models to Interactive World Models

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long

ICLR 2026

/iclr/2026/huang2026iclr-vid2world/

Abstract

World models, which predict future transitions from past observation and action sequences, have shown great promise for improving data efficiency in sequential decision-making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their usefulness in complex environments. In contrast, video diffusion models trained on large-scale internet data have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present _Vid2World_, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World systematically explores _video diffusion causalization_, reshaping both the architecture and training objective of pre-trained models to enable autoregressive generation. Additionally, it incorporates a _causal action guidance_ mechanism to enhance action controllability in the resulting interactive world models. Extensive experiments across multiple domains, including robot manipulation, 3D game simulation, and open-world navigation, demonstrate that our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Huang et al. "Vid2World: Crafting Video Diffusion Models to Interactive World Models." International Conference on Learning Representations, 2026.

Markdown

[Huang et al. "Vid2World: Crafting Video Diffusion Models to Interactive World Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/huang2026iclr-vid2world/)

BibTeX

@inproceedings{huang2026iclr-vid2world,
  title     = {{Vid2World: Crafting Video Diffusion Models to Interactive World Models}},
  author    = {Huang, Siqiao and Wu, Jialong and Zhou, Qixing and Miao, Shangchen and Long, Mingsheng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/huang2026iclr-vid2world/}
}