Vinci: Deep Thinking in Text-to-Image Generation Using Unified Model with Reinforcement Learning

Lin, Wang; Hu, Wentao; Jia, Liyu; Pan, Kaihang; Majun, Zhang; Zhao, Zhou; Wu, Fei; Chen, Jingyuan; Zhang, Hanwang

Vinci: Deep Thinking in Text-to-Image Generation Using Unified Model with Reinforcement Learning

Wang Lin, Wentao Hu, Liyu Jia, Kaihang Pan, Zhang Majun, Zhou Zhao, Fei Wu, Jingyuan Chen, Hanwang Zhang

NeurIPS 2025

/neurips/2025/lin2025neurips-vinci/

Abstract

With the continuous development of large language models and reasoning chain technologies, the potential of deep reasoning based on reinforcement learning has shown remarkable promise in multi-task scenarios. However, existing unified models have yet to achieve end-to-end integration in image generation and understanding tasks, limiting the model’s self-reflection ability and the realization of cross-modal reasoning chains. To address this, we propose Vinic, a novel framework designed to enable interleaved image generation and understanding through deep reasoning capabilities. We leverage a small amount of multimodal chain-of-thought (MCoT) data for cold-start and employ reinforcement learning to guide the integration of image generation and understanding tasks. Additionally, we introduce a momentum-based reward function, which dynamically adjusts the reward distribution by considering historical improvements, ensuring the stability of the model across multiple generations. Experimental results demonstrate that integrating MCoT can achieve a +22% improvement over the base model on Geneval, effectively enhancing both image generation quality and instruction alignment capabilities.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Lin et al. "Vinci: Deep Thinking in Text-to-Image Generation Using Unified Model with Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Lin et al. "Vinci: Deep Thinking in Text-to-Image Generation Using Unified Model with Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/lin2025neurips-vinci/)

BibTeX

@inproceedings{lin2025neurips-vinci,
  title     = {{Vinci: Deep Thinking in Text-to-Image Generation Using Unified Model with Reinforcement Learning}},
  author    = {Lin, Wang and Hu, Wentao and Jia, Liyu and Pan, Kaihang and Majun, Zhang and Zhao, Zhou and Wu, Fei and Chen, Jingyuan and Zhang, Hanwang},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/lin2025neurips-vinci/}
}