Instruct-Imagen: Image Generation with Multi-Modal Instruction

Abstract

This paper presents Instruct-Imagen a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction for image generation a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g. text edge style subject etc.) such that abundant generation intents can be standardized in a uniform format. We then build Instruct-Imagen by fine-tuning a pre-trained text-to-image diffusion model with two stages. First we adapt the model using the retrieval-augmented training to enhance model's capabilities to ground its generation on external multi-modal context. Subsequently we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g. subject-driven generation etc.) each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that Instruct-Imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks. Our evaluation suite will be made publicly available.

Cite

Text

Hu et al. "Instruct-Imagen: Image Generation with Multi-Modal Instruction." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00455

Markdown

[Hu et al. "Instruct-Imagen: Image Generation with Multi-Modal Instruction." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/hu2024cvpr-instructimagen/) doi:10.1109/CVPR52733.2024.00455

BibTeX

@inproceedings{hu2024cvpr-instructimagen,
  title     = {{Instruct-Imagen: Image Generation with Multi-Modal Instruction}},
  author    = {Hu, Hexiang and Chan, Kelvin C.K. and Su, Yu-Chuan and Chen, Wenhu and Li, Yandong and Sohn, Kihyuk and Zhao, Yang and Ben, Xue and Gong, Boqing and Cohen, William and Chang, Ming-Wei and Jia, Xuhui},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {4754-4763},
  doi       = {10.1109/CVPR52733.2024.00455},
  url       = {https://mlanthology.org/cvpr/2024/hu2024cvpr-instructimagen/}
}