ImageGen-CoT: Enhancing Text-to-Image In-Context Learning with Chain-of-Thought Reasoning
Abstract
In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a reasoning chain called ImageGen-CoT prior to image generation. To avoid generating ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple reasoning chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80% performance gain for SEED-X on T2I-ICL tasks. See our project page at https://ImageGen-CoT.github.io/. Code will be open-sourced.
Cite
Text
Liao et al. "ImageGen-CoT: Enhancing Text-to-Image In-Context Learning with Chain-of-Thought Reasoning." International Conference on Computer Vision, 2025.Markdown
[Liao et al. "ImageGen-CoT: Enhancing Text-to-Image In-Context Learning with Chain-of-Thought Reasoning." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/liao2025iccv-imagegencot/)BibTeX
@inproceedings{liao2025iccv-imagegencot,
title = {{ImageGen-CoT: Enhancing Text-to-Image In-Context Learning with Chain-of-Thought Reasoning}},
author = {Liao, Jiaqi and Yang, Zhengyuan and Li, Linjie and Li, Dianqi and Lin, Kevin and Cheng, Yu and Wang, Lijuan},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {17214-17223},
url = {https://mlanthology.org/iccv/2025/liao2025iccv-imagegencot/}
}