Large-Scale Text-to-Image Model with Inpainting Is a Zero-Shot Subject-Driven Image Generator

Abstract

Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications.

Cite

Text

Shin et al. "Large-Scale Text-to-Image Model with Inpainting Is a Zero-Shot Subject-Driven Image Generator." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00748

Markdown

[Shin et al. "Large-Scale Text-to-Image Model with Inpainting Is a Zero-Shot Subject-Driven Image Generator." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/shin2025cvpr-largescale/) doi:10.1109/CVPR52734.2025.00748

BibTeX

@inproceedings{shin2025cvpr-largescale,
  title     = {{Large-Scale Text-to-Image Model with Inpainting Is a Zero-Shot Subject-Driven Image Generator}},
  author    = {Shin, Chaehun and Choi, Jooyoung and Kim, Heeseung and Yoon, Sungroh},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {7986-7996},
  doi       = {10.1109/CVPR52734.2025.00748},
  url       = {https://mlanthology.org/cvpr/2025/shin2025cvpr-largescale/}
}