Vinci: Deep Thinking in Text-to-Image Generation Using Unified Model with Reinforcement Learning

Abstract

With the continuous development of large language models and reasoning chain technologies, the potential of deep reasoning based on reinforcement learning has shown remarkable promise in multi-task scenarios. However, existing unified models have yet to achieve end-to-end integration in image generation and understanding tasks, limiting the model’s self-reflection ability and the realization of cross-modal reasoning chains. To address this, we propose Vinic, a novel framework designed to enable interleaved image generation and understanding through deep reasoning capabilities. We leverage a small amount of multimodal chain-of-thought (MCoT) data for cold-start and employ reinforcement learning to guide the integration of image generation and understanding tasks. Additionally, we introduce a momentum-based reward function, which dynamically adjusts the reward distribution by considering historical improvements, ensuring the stability of the model across multiple generations. Experimental results demonstrate that integrating MCoT can achieve a +22% improvement over the base model on Geneval, effectively enhancing both image generation quality and instruction alignment capabilities.

Cite

Text

Lin et al. "Vinci: Deep Thinking in Text-to-Image Generation Using Unified Model with Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Lin et al. "Vinci: Deep Thinking in Text-to-Image Generation Using Unified Model with Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/lin2025neurips-vinci/)

BibTeX

@inproceedings{lin2025neurips-vinci,
  title     = {{Vinci: Deep Thinking in Text-to-Image Generation Using Unified Model with Reinforcement Learning}},
  author    = {Lin, Wang and Hu, Wentao and Jia, Liyu and Pan, Kaihang and Majun, Zhang and Zhao, Zhou and Wu, Fei and Chen, Jingyuan and Zhang, Hanwang},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/lin2025neurips-vinci/}
}