VidEvo: Evolving Video Editing Through Exhaustive Temporal Modeling
Abstract
Text-guided video editing (TGVE) has become a recent hotspot due to its entertainment value and practical applications. To reduce overhead, existing methods primarily extend from text-to-image diffusion models and typically involve reconstruction and editing phases. However, challenges persist, particularly in enhancing temporal consistency of a video while adhering to textual alignment requirements. A crucial factor leading to the aforementioned issue is the inadequate and implicit tuning of the attention module within existing methods, which is specifically designed to capture temporal information. In light of this, we introduce VidEvo, a novel one-shot video editing method that leverages explicit cues derived from the original video to enhance temporal modeling. By integrating null-video embedding (NVE) and window-frame attention (WFA) components, VidEvo facilitates the smooth and coherent generation of videos from global and local perspectives simultaneously. To be specific, NVE learns a set of multi-scale temporal embeddings within the visual space during the reconstruction phase. These embeddings are subsequently directly injected into the attention module of the editing phase, explicitly augmenting the temporal consistency of the entire video. On the other hand, WFA enhances local temporal modeling by dynamically optimizing attention mechanisms between adjacent frames, which improves temporal coherence with reduced computational costs. Experimental evaluations show that VidEvo enhances frame-to-frame temporal consistency. Ablation studies confirm NVE and WFA’s effectiveness and their plug-and-play capability with other methods.
Cite
Text
Dang et al. "VidEvo: Evolving Video Editing Through Exhaustive Temporal Modeling." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/99Markdown
[Dang et al. "VidEvo: Evolving Video Editing Through Exhaustive Temporal Modeling." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/dang2025ijcai-videvo/) doi:10.24963/IJCAI.2025/99BibTeX
@inproceedings{dang2025ijcai-videvo,
title = {{VidEvo: Evolving Video Editing Through Exhaustive Temporal Modeling}},
author = {Dang, Sizhe and Liu, Huan and Wang, Mengmeng and Lai, Xin and Dai, Guang and Wang, Jingdong},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2025},
pages = {882-890},
doi = {10.24963/IJCAI.2025/99},
url = {https://mlanthology.org/ijcai/2025/dang2025ijcai-videvo/}
}