Do Text-Free Diffusion Models Learn Discriminative Visual Representations?

Abstract

Diffusion models have proven to be state-of-the-art methods for generative tasks. These models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high-fidelity, diverse, novel images. However, text-free diffusion models have typically not been explored for discriminative tasks. In this work, we take a pre-trained unconditional diffusion model and analyze its features post hoc. We find that the intermediate feature maps of the pre-trained U-Net are diverse and have hidden discriminative representation properties. To unleash the potential of these latent properties of diffusion models, we present novel aggregation schemes. Firstly, we propose a novel attention mechanism for pooling feature maps and further leverage this mechanism as DifFormer, a transformer feature fusion of different diffusion U-Net blocks and noise steps. Next, we also develop DifFeed, a novel feedback mechanism tailored to diffusion. We find that diffusion models are better than GANs, and, with our fusion and feedback mechanisms, can compete with state-of-the-art representation learning methods for discriminative tasks – image classification with full and semi-supervision, transfer for fine-grained classification, object detection, and semantic segmentation. Our project website and code are available publicly.

Cite

Text

Mukhopadhyay et al. "Do Text-Free Diffusion Models Learn Discriminative Visual Representations?." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73027-6_15

Markdown

[Mukhopadhyay et al. "Do Text-Free Diffusion Models Learn Discriminative Visual Representations?." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/mukhopadhyay2024eccv-textfree/) doi:10.1007/978-3-031-73027-6_15

BibTeX

@inproceedings{mukhopadhyay2024eccv-textfree,
  title     = {{Do Text-Free Diffusion Models Learn Discriminative Visual Representations?}},
  author    = {Mukhopadhyay, Soumik and Gwilliam, Matthew A and Yamaguchi, Yosuke and Agarwal, Vatsal and Padmanabhan, Namitha and Swaminathan, Archana and Zhou, Tianyi and Ohya, Jun and Shrivastava, Abhinav},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73027-6_15},
  url       = {https://mlanthology.org/eccv/2024/mukhopadhyay2024eccv-textfree/}
}