CIGLI: Conditional Image Generation from Language & Image

Abstract

Multi-modal generation has been widely explored in recent years. Current research directions involve generating text based on an image or vice versa. In this paper, we propose a new task called CIGLI: Conditional Image Generation from Language and Image. Instead of generating an image based on text as in text-image generation, this task requires the generation of an image from a textual description and an image prompt. We designed a new dataset to ensure that the text description describes information from both images, and that solely analyzing the description is insufficient to generate an image. We then propose a novel language-image fusion model which improves the performance over two established baseline methods, as evaluated by quantitative (automatic) and qualitative (human) evaluations. The code and dataset is available at https://github.com/vincentlux/CIGLI.

Cite

Text

Lu et al. "CIGLI: Conditional Image Generation from Language & Image." IEEE/CVF International Conference on Computer Vision Workshops, 2021. doi:10.1109/ICCVW54120.2021.00349

Markdown

[Lu et al. "CIGLI: Conditional Image Generation from Language & Image." IEEE/CVF International Conference on Computer Vision Workshops, 2021.](https://mlanthology.org/iccvw/2021/lu2021iccvw-cigli/) doi:10.1109/ICCVW54120.2021.00349

BibTeX

@inproceedings{lu2021iccvw-cigli,
  title     = {{CIGLI: Conditional Image Generation from Language & Image}},
  author    = {Lu, Xiaopeng and Ng, Lynnette Hui Xian and Fernandez, Jared and Zhu, Hao},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2021},
  pages     = {3127-3131},
  doi       = {10.1109/ICCVW54120.2021.00349},
  url       = {https://mlanthology.org/iccvw/2021/lu2021iccvw-cigli/}
}