Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis
Abstract
In this paper we introduce Fairy a minimalist yet robust adaptation of image-editing diffusion models enhancing them for video editing applications. Our approach centers on the concept of anchor-based cross-frame attention a mechanism that implicitly propagates diffusion features across frames ensuring superior temporal coherence and high-fidelity synthesis. Fairy not only addresses limitations of previous models including memory and processing speed. It also improves temporal consistency through a unique data augmentation strategy. This strategy renders the model equivariant to affine transformations in both source and target images. Remarkably efficient Fairy generates 120-frame 512x384 videos (4-second duration at 30 FPS) in just 14 seconds outpacing prior works by at least 44x. A comprehensive user study involving 1000 generated samples confirms that our approach delivers superior quality decisively outperforming established methods.
Cite
Text
Wu et al. "Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00789Markdown
[Wu et al. "Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/wu2024cvpr-fairy/) doi:10.1109/CVPR52733.2024.00789BibTeX
@inproceedings{wu2024cvpr-fairy,
title = {{Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis}},
author = {Wu, Bichen and Chuang, Ching-Yao and Wang, Xiaoyan and Jia, Yichen and Krishnakumar, Kapil and Xiao, Tong and Liang, Feng and Yu, Licheng and Vajda, Peter},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {8261-8270},
doi = {10.1109/CVPR52733.2024.00789},
url = {https://mlanthology.org/cvpr/2024/wu2024cvpr-fairy/}
}