REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

Abstract

Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.

Cite

Text

Xu et al. "REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing." Advances in Neural Information Processing Systems, 2025.

Markdown

[Xu et al. "REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/xu2025neurips-regen/)

BibTeX

@inproceedings{xu2025neurips-regen,
  title     = {{REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing}},
  author    = {Xu, Weihan and Ma, Yimeng and Huang, Jingyue and Li, Yang and Ma, Wenye and Berg-Kirkpatrick, Taylor and McAuley, Julian and Liang, Paul Pu and Dong, Hao-Wen},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/xu2025neurips-regen/}
}