Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images
Abstract
This paper studies visual representation learning with diffusion-generated synthetic images. We start by uncovering that diffusion models’ cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent representation learning methods (i.e., contrastive learning, masked modeling, and vision-language pretraining) on diffusion-generated synthetic data and introduce customized solutions by fully exploiting the aforementioned free attention masks, namely Free-ATM. Comprehensive experiments demonstrate Free-ATM’s ability to enhance the performance of various representation learning frameworks when utilizing synthetic data. This improvement is consistent across diverse downstream tasks including image classification, detection, segmentation and image-text retrieval. Meanwhile, by utilizing Free-ATM, we can accelerate the pretraining on synthetic images significantly and close the performance gap between representation learning on synthetic data and real-world scenarios.
Cite
Text
Zhang et al. "Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73661-2_26Markdown
[Zhang et al. "Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zhang2024eccv-freeatm/) doi:10.1007/978-3-031-73661-2_26BibTeX
@inproceedings{zhang2024eccv-freeatm,
title = {{Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images}},
author = {Zhang, David Junhao and Xu, Mutian and Wu, Jay Zhangjie and Xue, Chuhui and Zhang, Wenqing and Han, Xiaoguang and Bai, Song and Shou, Mike Zheng},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-73661-2_26},
url = {https://mlanthology.org/eccv/2024/zhang2024eccv-freeatm/}
}