Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis

Abstract

In this paper we introduce Fairy a minimalist yet robust adaptation of image-editing diffusion models enhancing them for video editing applications. Our approach centers on the concept of anchor-based cross-frame attention a mechanism that implicitly propagates diffusion features across frames ensuring superior temporal coherence and high-fidelity synthesis. Fairy not only addresses limitations of previous models including memory and processing speed. It also improves temporal consistency through a unique data augmentation strategy. This strategy renders the model equivariant to affine transformations in both source and target images. Remarkably efficient Fairy generates 120-frame 512x384 videos (4-second duration at 30 FPS) in just 14 seconds outpacing prior works by at least 44x. A comprehensive user study involving 1000 generated samples confirms that our approach delivers superior quality decisively outperforming established methods.

Cite

Text

Wu et al. "Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00789

Markdown

[Wu et al. "Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/wu2024cvpr-fairy/) doi:10.1109/CVPR52733.2024.00789

BibTeX

@inproceedings{wu2024cvpr-fairy,
  title     = {{Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis}},
  author    = {Wu, Bichen and Chuang, Ching-Yao and Wang, Xiaoyan and Jia, Yichen and Krishnakumar, Kapil and Xiao, Tong and Liang, Feng and Yu, Licheng and Vajda, Peter},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {8261-8270},
  doi       = {10.1109/CVPR52733.2024.00789},
  url       = {https://mlanthology.org/cvpr/2024/wu2024cvpr-fairy/}
}