ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning
Abstract
Recently, advancements in video synthesis have attracted significant attention. Video synthesis models have demonstrated the practical applicability of diffusion models in creating dynamic visual content. Despite these advancements, the extension of video lengths remains constrained by computational resources. Most existing video synthesis models are limited to generating short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we trained ExSVD, an extended model based on Stable Video Diffusion model. Our approach enhances the model's capacity to generate up to 5x its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn't compromise the model's innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We have released the source code and the enhanced model publicly.
Cite
Text
Duan et al. "ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/1118Markdown
[Duan et al. "ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/duan2025ijcai-exvideo/) doi:10.24963/IJCAI.2025/1118BibTeX
@inproceedings{duan2025ijcai-exvideo,
title = {{ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning}},
author = {Duan, Zhongjie and Zhang, Hong and Zhou, Wenmeng and Chen, Cen and Li, Yaliang and Zhang, Yu and Chen, Yingda},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2025},
pages = {10063-10071},
doi = {10.24963/IJCAI.2025/1118},
url = {https://mlanthology.org/ijcai/2025/duan2025ijcai-exvideo/}
}