Large Language Models Are Frame-Level Directors for Zero-Shot Text-to-Video Generation

Abstract

In the paradigm of AI-generated content (AIGC), there has been increasing attention to transferring knowledge from pre-trained text-to-image (T2I) models to text-to-video (T2V) generation. Despite their effectiveness, these frameworks face challenges in maintaining consistent narratives and handling shifts in scene composition or object placement from a single abstract user prompt. Exploring the ability of large language models (LLMs) to generate time-dependent, frame-by-frame prompts, this paper introduces a new framework, dubbed DirecT2V. DirecT2V leverages instruction-tuned LLMs as directors, enabling the inclusion of time-varying content and facilitating consistent video generation. To maintain temporal consistency and prevent mapping the value to a different object, we equip a diffusion model with a novel value mapping method and dual-softmax filtering.

Cite

Text

Hong et al. "Large Language Models Are Frame-Level Directors for Zero-Shot Text-to-Video Generation." ICML 2024 Workshops: CVG, 2024.

Markdown

[Hong et al. "Large Language Models Are Frame-Level Directors for Zero-Shot Text-to-Video Generation." ICML 2024 Workshops: CVG, 2024.](https://mlanthology.org/icmlw/2024/hong2024icmlw-large/)

BibTeX

@inproceedings{hong2024icmlw-large,
  title     = {{Large Language Models Are Frame-Level Directors for Zero-Shot Text-to-Video Generation}},
  author    = {Hong, Susung and Seo, Junyoung and Shin, Heeseong and Hong, Sunghwan and Kim, Seungryong},
  booktitle = {ICML 2024 Workshops: CVG},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/hong2024icmlw-large/}
}