Learning Visual Generative Priors Without Text

Abstract

Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive. We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling. Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner. We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models. We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning. We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant vision tasks, like image-to-3D and image-to-video.

Cite

Text

Ma et al. "Learning Visual Generative Priors Without Text." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00754

Markdown

[Ma et al. "Learning Visual Generative Priors Without Text." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/ma2025cvpr-learning/) doi:10.1109/CVPR52734.2025.00754

BibTeX

@inproceedings{ma2025cvpr-learning,
  title     = {{Learning Visual Generative Priors Without Text}},
  author    = {Ma, Shuailei and Zheng, Kecheng and Wei, Ying and Wu, Wei and Lu, Fan and Zhang, Yifei and Xie, Chen-Wei and Gong, Biao and Zhu, Jiapeng and Shen, Yujun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {8051-8061},
  doi       = {10.1109/CVPR52734.2025.00754},
  url       = {https://mlanthology.org/cvpr/2025/ma2025cvpr-learning/}
}