Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation
Abstract
Text-to-image generation plays a pivotal role in computer vision and natural language processing by translating textual descriptions into visual representations. However, understanding complex relations in detailed text prompts filled with rich relational content remains a significant challenge. To address this, we introduce a novel task: Logic-Rich Text-to-Image generation. Unlike conventional image generation tasks that rely on short and structurally simple natural language inputs, our task focuses on intricate text inputs abundant in relational information. To tackle these complexities, we collect the Textual-Visual Logic dataset, designed to evaluate the performance of text-to-image generation models across diverse and complex scenarios. Furthermore, we propose a baseline model as a benchmark for this task. Our model comprises three key components: a relation understanding module, a multimodality fusion module, and a negative pair discriminator. These components enhance the model’s ability to handle disturbances in informative tokens and prioritize relational elements during image generation. https:// github.com/IntelLabs/Textual-Visual-Logic-Challenge
Cite
Text
Xiong et al. "Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72652-1_19Markdown
[Xiong et al. "Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/xiong2024eccv-textualvisual/) doi:10.1007/978-3-031-72652-1_19BibTeX
@inproceedings{xiong2024eccv-textualvisual,
title = {{Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation}},
author = {Xiong, Peixi and Kozuch, Michael A and Jain, Nilesh},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72652-1_19},
url = {https://mlanthology.org/eccv/2024/xiong2024eccv-textualvisual/}
}