DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation
Abstract
Spatio-temporal consistency is a critical topic in video generation. A qualified generated video segment must ensure plot plausibility and coherence while maintaining visual consistency of objects and scenes across varying viewpoints. Prior research, especially in open-source projects, primarily focuses on either temporal or spatial consistency, or their basic combination, such as appending a camera-movement description after a prompt without constraining its outcomes. However, camera movement may introduce new objects to the scene or eliminate existing ones, thereby affecting the preceding narrative. Especially in videos with numerous camera movements, the interplay between multiple plots becomes increasingly complex. This paper introduces and examines integral spatio-temporal consistency, considering the synergy between plot progression and camera techniques, and the long-term impact of prior content on subsequent generation. Our research encompasses dataset construction through to model development. Initially, we constructed DropletVideo-10M, which comprises 10 million videos that feature dynamic camera motion and object actions. With an average length of 206 words, the captions offer detailed accounts of camera movements. Following this, we developed the DropletVideo model, which excels in preserving spatio-temporal coherence during video generation. The work has been open-sourced: https://dropletx.github.io/.
Cite
Text
Zhang et al. "DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation." International Conference on Computer Vision, 2025.Markdown
[Zhang et al. "DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zhang2025iccv-dropletvideo/)BibTeX
@inproceedings{zhang2025iccv-dropletvideo,
title = {{DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation}},
author = {Zhang, Runze and Du, Guoguang and Li, Xiaochuan and Jia, Qi and Jin, Liang and Liu, Lu and Wang, Jingjing and Xu, Cong and Guo, Zhenhua and Zhao, Yaqian and Gong, Xiaoli and Li, Rengang and Fan, Baoyu},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {15583-15593},
url = {https://mlanthology.org/iccv/2025/zhang2025iccv-dropletvideo/}
}