Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images

Abstract

This paper studies visual representation learning with diffusion-generated synthetic images. We start by uncovering that diffusion models’ cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent representation learning methods (i.e., contrastive learning, masked modeling, and vision-language pretraining) on diffusion-generated synthetic data and introduce customized solutions by fully exploiting the aforementioned free attention masks, namely Free-ATM. Comprehensive experiments demonstrate Free-ATM’s ability to enhance the performance of various representation learning frameworks when utilizing synthetic data. This improvement is consistent across diverse downstream tasks including image classification, detection, segmentation and image-text retrieval. Meanwhile, by utilizing Free-ATM, we can accelerate the pretraining on synthetic images significantly and close the performance gap between representation learning on synthetic data and real-world scenarios.

Cite

Text

Zhang et al. "Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73661-2_26

Markdown

[Zhang et al. "Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zhang2024eccv-freeatm/) doi:10.1007/978-3-031-73661-2_26

BibTeX

@inproceedings{zhang2024eccv-freeatm,
  title     = {{Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images}},
  author    = {Zhang, David Junhao and Xu, Mutian and Wu, Jay Zhangjie and Xue, Chuhui and Zhang, Wenqing and Han, Xiaoguang and Bai, Song and Shou, Mike Zheng},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73661-2_26},
  url       = {https://mlanthology.org/eccv/2024/zhang2024eccv-freeatm/}
}