VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
Abstract
Image generation and manipulation requires technical expertise to use, inhibiting adoption. Current methods rely heavily on training to a specific domain (e.g., only faces), manual work or algorithm tuning to latent vector discovery, and manual effort in mask selection to alter only a part of an image. We address all of these usability constraints while producing images of high visual and semantic quality through a unique combination of OpenAI’s CLIP (Radford et al., 2021), VQGAN (Esser et al., 2021), and a generation augmentation strategy to produce VQGAN-CLIP. This allows generation and manipulation of images using natural language text, without further training on any domain datasets. We demonstrate on a variety of tasks how VQGAN-CLIP produces higher visual quality outputs than prior, less flexible approaches like minDALL-E (Kakaobrain, 2021) and Open-Edit (Liu, 2020), despite not being trained for the tasks presented.
Cite
Text
Crowson et al. "VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19836-6Markdown
[Crowson et al. "VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/crowson2022eccv-vqganclip/) doi:10.1007/978-3-031-19836-6BibTeX
@inproceedings{crowson2022eccv-vqganclip,
title = {{VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance}},
author = {Crowson, Katherine and Biderman, Stella and Kornis, Daniel and Stander, Dashiell and Hallahan, Eric and Castricato, Louis and Raff, Edward},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022},
doi = {10.1007/978-3-031-19836-6},
url = {https://mlanthology.org/eccv/2022/crowson2022eccv-vqganclip/}
}