Discffusion: Discriminative Diffusion Models as Few-Shot Vision and Language Learners
Abstract
Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (Discffusion), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via a new attention-based prompt learning to perform image-text matching. By comparing Discffusion with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.
Cite
Text
He et al. "Discffusion: Discriminative Diffusion Models as Few-Shot Vision and Language Learners." Transactions on Machine Learning Research, 2024.Markdown
[He et al. "Discffusion: Discriminative Diffusion Models as Few-Shot Vision and Language Learners." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/he2024tmlr-discffusion/)BibTeX
@article{he2024tmlr-discffusion,
title = {{Discffusion: Discriminative Diffusion Models as Few-Shot Vision and Language Learners}},
author = {He, Xuehai and Feng, Weixi and Fu, Tsu-Jui and Jampani, Varun and Akula, Arjun Reddy and Narayana, Pradyumna and Basu, S and Wang, William Yang and Wang, Xin Eric},
journal = {Transactions on Machine Learning Research},
year = {2024},
url = {https://mlanthology.org/tmlr/2024/he2024tmlr-discffusion/}
}