FreeGen: Bridging Visual-Linguistic Discrepancies Towards Diffusion-Based Pixel-Level Data Synthesis
Abstract
Text-to-image diffusion model has inspired research into text-to-data synthesis without human intervention, where spatial attentions correlated with semantic entities in text prompts are primarily interpreted as pseudo-masks. However, these vannila attentions often deliver visual-linguistic discrepancies, in which the associations between image features and entity-level tokens are unstable and divergent, yielding inferior masks for realistic applications, especially in more practical open-vocabulary settings. To tackle this issue, we propose a novel text-guided self-driven generative paradigm, termed FreeGen, which addresses the discrepancies by recalibrating intrinsic visual-linguistic correlations and serves as a real-data-free method to automatically synthesize open-vocabulary pixel-level data for arbitrary entities. Specifically, we first learn an Attention Self-Rectification mechanism to reproject the inherent attention matrices to achieve robust semantic alignment, thereby obtaining class-discriminative masks. A Temporal Fluctuation Factor is present to assess mask quality based on its variation over uniform sampling timesteps, enabling the selection of reliable masks. These masks are then employed as self-supervised signals to support the learning of an Entity-level Grounding Decoder in a self-training manner, thus producing open-vocabulary segmentation results. Extensive experiments show that the existing segmenters trained on FreeGen narrow the performance gap with real data counterparts and remarkably outperform the state-of-the-art methods.
Cite
Text
Wang et al. "FreeGen: Bridging Visual-Linguistic Discrepancies Towards Diffusion-Based Pixel-Level Data Synthesis." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I8.32853Markdown
[Wang et al. "FreeGen: Bridging Visual-Linguistic Discrepancies Towards Diffusion-Based Pixel-Level Data Synthesis." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/wang2025aaai-freegen/) doi:10.1609/AAAI.V39I8.32853BibTeX
@inproceedings{wang2025aaai-freegen,
title = {{FreeGen: Bridging Visual-Linguistic Discrepancies Towards Diffusion-Based Pixel-Level Data Synthesis}},
author = {Wang, Wenzhuang and Ma, Mingcan and Chen, Yong and Xia, Changqun and Liang, Zhenbao and Li, Jia},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {7916-7924},
doi = {10.1609/AAAI.V39I8.32853},
url = {https://mlanthology.org/aaai/2025/wang2025aaai-freegen/}
}