ControlVideo: Training-Free Controllable Text-to-Video Generation

Abstract

Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart lags behind due to the excessive training cost. To avert the training burden, we propose a training-free ControlVideo to produce high-quality videos based on the provided text prompts and motion sequences. Specifically, ControlVideo adapts a pre-trained text-to-image model (i.e., ControlNet) for controllable text-to-video generation. To generate continuous videos without flicker effect, we propose an interleaved-frame smoother to smooth the intermediate frames. In particular, interleaved-frame smoother splits the whole videos with successive three-frame clips, and stabilizes each clip by updating the middle frame with the interpolation among other two frames in latent space. Furthermore, a fully cross-frame interaction mechanism have been exploited to further enhance the frame consistency, while a hierarchical sampler is employed to produce long videos efficiently. Extensive experiments demonstrate that our ControlVideo outperforms the state-of-the-arts both quantitatively and qualitatively. It is worthy noting that, thanks to the efficient designs, ControlVideo could generate both short and long videos within several minutes using one NVIDIA 2080Ti. Code and videos are available at [this link](https://github.com/YBYBZhang/ControlVideo).

Cite

Text

Zhang et al. "ControlVideo: Training-Free Controllable Text-to-Video Generation." International Conference on Learning Representations, 2024.

Markdown

[Zhang et al. "ControlVideo: Training-Free Controllable Text-to-Video Generation." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/zhang2024iclr-controlvideo/)

BibTeX

@inproceedings{zhang2024iclr-controlvideo,
  title     = {{ControlVideo: Training-Free Controllable Text-to-Video Generation}},
  author    = {Zhang, Yabo and Wei, Yuxiang and Jiang, Dongsheng and Zhang, Xiaopeng and Zuo, Wangmeng and Tian, Qi},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/zhang2024iclr-controlvideo/}
}