Text-to-Image Generation via Energy-Based CLIP
Abstract
Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present CLIP-JEM, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative one, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative one, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. CLIP-JEM not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance capability of CLIP-JEM by enhancing CLIP-based generative frameworks and converting unconditional diffusion models to text-based ones. Lastly, we show that our model can serve as a more robust evaluation metric for text-to-image generative tasks than CLIP.
Cite
Text
Ganz and Elad. "Text-to-Image Generation via Energy-Based CLIP." Transactions on Machine Learning Research, 2025.Markdown
[Ganz and Elad. "Text-to-Image Generation via Energy-Based CLIP." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/ganz2025tmlr-texttoimage/)BibTeX
@article{ganz2025tmlr-texttoimage,
title = {{Text-to-Image Generation via Energy-Based CLIP}},
author = {Ganz, Roy and Elad, Michael},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/ganz2025tmlr-texttoimage/}
}