TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Abstract

Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g. a woman's photo) and a text description (e.g. "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper we propose TI2V-Zero a zero-shot tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image enabling TI2V generation without any optimization fine-tuning or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input we propose a "repeat-and-slide" strategy that modulates the reverse denoising process allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.

Cite

Text

Ni et al. "TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00861

Markdown

[Ni et al. "TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/ni2024cvpr-ti2vzero/) doi:10.1109/CVPR52733.2024.00861

BibTeX

@inproceedings{ni2024cvpr-ti2vzero,
  title     = {{TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models}},
  author    = {Ni, Haomiao and Egger, Bernhard and Lohit, Suhas and Cherian, Anoop and Wang, Ye and Koike-Akino, Toshiaki and Huang, Sharon X. and Marks, Tim K.},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {9015-9025},
  doi       = {10.1109/CVPR52733.2024.00861},
  url       = {https://mlanthology.org/cvpr/2024/ni2024cvpr-ti2vzero/}
}