Contrastive Sequential-Diffusion Learning: Non-Linear and Multi-Scene Instructional Video Synthesis

Abstract

Generated video scenes for action-centric sequence descriptions such as recipe instructions and do-it-yourself projects often include non-linear patterns where the next video may need to be visually consistent not with the immediately preceding video but with earlier ones. Current multi-scene video synthesis approaches fail to meet these consistency requirements. To address this we propose a contrastive sequential video diffusion method that selects the most suitable previously generated scene to guide and condition the denoising process of the next scene. The result is a multi-scene video that is grounded in the scene descriptions and coherent w.r.t. the scenes that require visual consistency. Experiments with action-centered data from the real world demonstrate the practicality and improved consistency of our model compared to previous work. Code and examples available at https://github.com/novasearch/CoSeD

Cite

Text

Ramos et al. "Contrastive Sequential-Diffusion Learning: Non-Linear and Multi-Scene Instructional Video Synthesis." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Ramos et al. "Contrastive Sequential-Diffusion Learning: Non-Linear and Multi-Scene Instructional Video Synthesis." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/ramos2025wacv-contrastive/)

BibTeX

@inproceedings{ramos2025wacv-contrastive,
  title     = {{Contrastive Sequential-Diffusion Learning: Non-Linear and Multi-Scene Instructional Video Synthesis}},
  author    = {Ramos, Vasco and Bitton, Yonatan and Yarom, Michal and Szpektor, Idan and Magalhaes, Joao},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {4645-4654},
  url       = {https://mlanthology.org/wacv/2025/ramos2025wacv-contrastive/}
}