Large Language Models Are Frame-Level Directors for Zero-Shot Text-to-Video Generation
Abstract
In the paradigm of AI-generated content (AIGC), there has been increasing attention to transferring knowledge from pre-trained text-to-image (T2I) models to text-to-video (T2V) generation. Despite their effectiveness, these frameworks face challenges in maintaining consistent narratives and handling shifts in scene composition or object placement from a single abstract user prompt. Exploring the ability of large language models (LLMs) to generate time-dependent, frame-by-frame prompts, this paper introduces a new framework, dubbed DirecT2V. DirecT2V leverages instruction-tuned LLMs as directors, enabling the inclusion of time-varying content and facilitating consistent video generation. To maintain temporal consistency and prevent mapping the value to a different object, we equip a diffusion model with a novel value mapping method and dual-softmax filtering.
Cite
Text
Hong et al. "Large Language Models Are Frame-Level Directors for Zero-Shot Text-to-Video Generation." ICML 2024 Workshops: CVG, 2024.Markdown
[Hong et al. "Large Language Models Are Frame-Level Directors for Zero-Shot Text-to-Video Generation." ICML 2024 Workshops: CVG, 2024.](https://mlanthology.org/icmlw/2024/hong2024icmlw-large/)BibTeX
@inproceedings{hong2024icmlw-large,
title = {{Large Language Models Are Frame-Level Directors for Zero-Shot Text-to-Video Generation}},
author = {Hong, Susung and Seo, Junyoung and Shin, Heeseong and Hong, Sunghwan and Kim, Seungryong},
booktitle = {ICML 2024 Workshops: CVG},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/hong2024icmlw-large/}
}