Video-GPT via Next CLIP Diffusion

Abstract

GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream.

Cite

Text

Zhuang et al. "Video-GPT via Next CLIP Diffusion." International Conference on Learning Representations, 2026.

Markdown

[Zhuang et al. "Video-GPT via Next CLIP Diffusion." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhuang2026iclr-videogpt/)

BibTeX

@inproceedings{zhuang2026iclr-videogpt,
  title     = {{Video-GPT via Next CLIP Diffusion}},
  author    = {Zhuang, Shaobin and Huang, Zhipeng and Zhang, Ying and Wang, Fangyikang and Fu, Canmiao and Yang, Binxin and Sun, Chong and Li, Chen and Wang, Yali},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhuang2026iclr-videogpt/}
}