VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation

Abstract

Creating stable, controllable videos is a complex task due to the need for significant variation in temporal dynamics and cross-frame temporal consistency. To address this, we enhance the spatial-temporal capability and introduce a versatile video generation model, VersVideo, which leverages textual, visual, and stylistic conditions. Current video diffusion models typically extend image diffusion architectures by supplementing 2D operations (such as convolutions and attentions) with temporal operations. While this approach is efficient, it often restricts spatial-temporal performance due to the oversimplification of standard 3D operations. To counter this, we incorporate two key elements: (1) multi-excitation paths for spatial-temporal convolutions with dimension pooling across different axes, and (2) multi-expert spatial-temporal attention blocks. These enhancements boost the model's spatial-temporal performance without significantly escalating training and inference costs. We also tackle the issue of information loss that arises when a variational autoencoder is used to transform pixel space into latent features and then back into pixel frames. To mitigate this, we incorporate temporal modules into the decoder to maintain inter-frame consistency. Lastly, by utilizing the innovative denoising UNet and decoder, we develop a unified ControlNet model suitable for various conditions, including image, Canny, HED, depth, and style. Examples of the videos generated by our model can be found at https://jinxixiang.github.io/versvideo/.

Cite

Text

Xiang et al. "VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation." International Conference on Learning Representations, 2024.

Markdown

[Xiang et al. "VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/xiang2024iclr-versvideo/)

BibTeX

@inproceedings{xiang2024iclr-versvideo,
  title     = {{VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation}},
  author    = {Xiang, Jinxi and Huang, Ricong and Zhang, Jun and Li, Guanbin and Han, Xiao and Wei, Yang},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/xiang2024iclr-versvideo/}
}