Text-to-Image Diffusion Models Are Zero-Shot Classifiers
Abstract
Text-to-image diffusion models have demonstrated remarkable generative capabilities, suggesting they learn informative representations of image-text data. However, their abilities are not fully understood and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a textual description of a label as a proxy for that label's likelihood. We apply our method to Imagen, using it to probe fine-grain aspects of Imagen's knowledge and comparing it with CLIP's zero-shot abilities. Imagen performs competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, it is more robust than CLIP and can successfully perform attribute binding while CLIP does not. Although generative pre-training is common in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for visual and vision-language problems.
Cite
Text
Clark and Jaini. "Text-to-Image Diffusion Models Are Zero-Shot Classifiers." ICLR 2023 Workshops: MRL, 2023.Markdown
[Clark and Jaini. "Text-to-Image Diffusion Models Are Zero-Shot Classifiers." ICLR 2023 Workshops: MRL, 2023.](https://mlanthology.org/iclrw/2023/clark2023iclrw-texttoimage/)BibTeX
@inproceedings{clark2023iclrw-texttoimage,
title = {{Text-to-Image Diffusion Models Are Zero-Shot Classifiers}},
author = {Clark, Kevin and Jaini, Priyank},
booktitle = {ICLR 2023 Workshops: MRL},
year = {2023},
url = {https://mlanthology.org/iclrw/2023/clark2023iclrw-texttoimage/}
}