Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation

Peixi Xiong, Michael A Kozuch, Nilesh Jain

ECCV 2024

doi:10.1007/978-3-031-72652-1_19 /eccv/2024/xiong2024eccv-textualvisual/

Abstract

Text-to-image generation plays a pivotal role in computer vision and natural language processing by translating textual descriptions into visual representations. However, understanding complex relations in detailed text prompts filled with rich relational content remains a significant challenge. To address this, we introduce a novel task: Logic-Rich Text-to-Image generation. Unlike conventional image generation tasks that rely on short and structurally simple natural language inputs, our task focuses on intricate text inputs abundant in relational information. To tackle these complexities, we collect the Textual-Visual Logic dataset, designed to evaluate the performance of text-to-image generation models across diverse and complex scenarios. Furthermore, we propose a baseline model as a benchmark for this task. Our model comprises three key components: a relation understanding module, a multimodality fusion module, and a negative pair discriminator. These components enhance the model’s ability to handle disturbances in informative tokens and prioritize relational elements during image generation. https:// github.com/IntelLabs/Textual-Visual-Logic-Challenge

PDF ECCV Semantic Scholar

Cite

Text

Xiong et al. "Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72652-1_19

Markdown

[Xiong et al. "Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/xiong2024eccv-textualvisual/) doi:10.1007/978-3-031-72652-1_19

BibTeX

@inproceedings{xiong2024eccv-textualvisual,
  title     = {{Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation}},
  author    = {Xiong, Peixi and Kozuch, Michael A and Jain, Nilesh},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72652-1_19},
  url       = {https://mlanthology.org/eccv/2024/xiong2024eccv-textualvisual/}
}