Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models

Saman Motamed, Wouter Van Gansbeke, Luc Van Gool

CVPRW 2024 pp. 7406-7415

doi:10.1109/CVPRW63382.2024.00736 /cvprw/2024/motamed2024cvprw-investigating/

Abstract

With recent advances in image and video diffusion models for content creation, a plethora of techniques have been proposed for customizing their generated content. In particular, manipulating the cross-attention layers of Text-to-Image (T2I) diffusion models has shown great promise in controlling the shape and location of objects in the scene. Transferring image-editing techniques to the video domain, however, is extremely challenging as object motion and temporal consistency are difficult to capture accurately. In this work, we take a first look at the role of cross-attention in Text-to-Video (T2V) diffusion models for zero-shot video editing. While one-shot models have shown potential in controlling motion and camera movement, we demonstrate zero-shot control over object shape, position and movement in T2V models. We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos. Code: https://github.com/sam-motamed/Video-Editing-X-Attention.git

PDF CVPRW Semantic Scholar

Cite

Text

Motamed et al. "Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00736

Markdown

[Motamed et al. "Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/motamed2024cvprw-investigating/) doi:10.1109/CVPRW63382.2024.00736

BibTeX

@inproceedings{motamed2024cvprw-investigating,
  title     = {{Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models}},
  author    = {Motamed, Saman and Van Gansbeke, Wouter and Van Gool, Luc},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {7406-7415},
  doi       = {10.1109/CVPRW63382.2024.00736},
  url       = {https://mlanthology.org/cvprw/2024/motamed2024cvprw-investigating/}
}