Text2LIVE: Text-Driven Layered Image and Video Editing

Abstract

We present a method for zero-shot, text-driven editing of natural images and videos. Given an image or a video and a text prompt, our goal is to edit the appearance of existing objects (e.g., texture) or augment the scene with visual effects (e.g., smoke, fire) in a semantic manner. We train a generator on an \emph{internal dataset}, extracted from a single input, while leveraging an \emph{external} pretrained CLIP model to impose our losses. Rather than directly generating the edited output, our key idea is to generate an \emph{edit layer} (color+opacity) that is composited over the input. This allows us to control the generation and maintain high fidelity to the input via novel text-driven losses applied directly to the edit layer. Our method neither relies on a pretrained generator nor requires user-provided masks. We demonstrate localized, semantic edits on high-resolution images and videos across a variety of objects and scenes.

Cite

Text

Bar-Tal et al. "Text2LIVE: Text-Driven Layered Image and Video Editing." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19784-0_41

Markdown

[Bar-Tal et al. "Text2LIVE: Text-Driven Layered Image and Video Editing." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/bartal2022eccv-text2live/) doi:10.1007/978-3-031-19784-0_41

BibTeX

@inproceedings{bartal2022eccv-text2live,
  title     = {{Text2LIVE: Text-Driven Layered Image and Video Editing}},
  author    = {Bar-Tal, Omer and Ofri-Amar, Dolev and Fridman, Rafail and Kasten, Yoni and Dekel, Tali},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19784-0_41},
  url       = {https://mlanthology.org/eccv/2022/bartal2022eccv-text2live/}
}