Storytelling Video Generation with Retrieval Augmentation and Character Consistency

Abstract

Despite the recent rapid advancements in text-to-video (T2V) generation, creating storytelling videos from text remains an important yet challenging task and is underexplored. In this work, we introduce an innovative storytelling video generation system that produces coherent videos with a storyline using only text prompts as input. Additionally, the generated character exhibits the same appearance across different clips, which clearly differentiates our work from other T2V models that suffer from varied character appearances. Our key novelties are twofold: a retrieval-augmented T2V generation system (RAG-T2V) and a cross-clip character consistency mechanism. The RAG-T2V consists of two functional components: (i) motion structure retrieval: searching videos of desired content and actions through query texts, and (ii) a structure-guided text-to-video generation model, generating plot-aligned videos according to text prompts and motion structure guidance. The character consistency mechanism is designed as a time-aware textual inversion process and can be learned without video character data (i.e., using only character images). Experimental results validate the storytelling video generation quality, character consistency, and semantic alignment of our proposed system, exhibiting significant advantages over various baselines.

Cite

Text

He et al. "Storytelling Video Generation with Retrieval Augmentation and Character Consistency." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-92808-6_14

Markdown

[He et al. "Storytelling Video Generation with Retrieval Augmentation and Character Consistency." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/he2024eccvw-storytelling/) doi:10.1007/978-3-031-92808-6_14

BibTeX

@inproceedings{he2024eccvw-storytelling,
  title     = {{Storytelling Video Generation with Retrieval Augmentation and Character Consistency}},
  author    = {He, Yingqing and Xia, Menghan and Chen, Haoxin and Cun, Xiaodong and Gong, Yuan and Xing, Jinbo and Zhang, Yong and Wang, Xintao and Weng, Chao and Shan, Ying and Chen, Qifeng},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2024},
  pages     = {218-234},
  doi       = {10.1007/978-3-031-92808-6_14},
  url       = {https://mlanthology.org/eccvw/2024/he2024eccvw-storytelling/}
}