LongDiff: Training-Free Long Video Generation in One Go

Abstract

Video diffusion models have recently achieved remarkable results in video generation. Despite their encouraging performance, most of these models are mainly designed and trained for short video generation, leading to challenges in maintaining temporal consistency and visual details in long video generation. In this paper, through theoretical analysis of the mechanisms behind video generation, we identify two key challenges that hinder short-to-long generalization, namely, temporal position ambiguity and information dilution. To address these challenges, we propose LongDiff, a novel training-free method that unlocks the potential of the off-the-shelf video diffusion models to achieve high-quality long video generation in one go. Extensive experiments demonstrate the efficacy of our method.

Cite

Text

Li et al. "LongDiff: Training-Free Long Video Generation in One Go." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01657

Markdown

[Li et al. "LongDiff: Training-Free Long Video Generation in One Go." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/li2025cvpr-longdiff/) doi:10.1109/CVPR52734.2025.01657

BibTeX

@inproceedings{li2025cvpr-longdiff,
  title     = {{LongDiff: Training-Free Long Video Generation in One Go}},
  author    = {Li, Zhuoling and Rahmani, Hossein and Ke, Qiuhong and Liu, Jun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {17789-17798},
  doi       = {10.1109/CVPR52734.2025.01657},
  url       = {https://mlanthology.org/cvpr/2025/li2025cvpr-longdiff/}
}