Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Abstract

We introduce Lumina-Image 2.0, an advanced text-to-image (T2I) model that surpasses previous state-of-the-art methods across multiple benchmarks. Lumina-Image 2.0 is characterized by two key features: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), which can generate detailed and accurate multilingual captions for our model. This not only accelerates model convergence, but also enhances prompt adherence, multi-granularity prompt handling, and task expansion with customized prompt templates. (2)Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies to optimize our model, alongside inference-time acceleration strategies without compromising image quality. We evaluate our model on academic benchmarks and T2I arenas, with results confirming that it matches or exceeds existing state-of-the-art models across various metrics, highlighting the effectiveness of our methods. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.

Cite

Text

Qin et al. "Lumina-Image 2.0: A Unified and Efficient Image Generative Framework." International Conference on Computer Vision, 2025.

Markdown

[Qin et al. "Lumina-Image 2.0: A Unified and Efficient Image Generative Framework." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/qin2025iccv-luminaimage/)

BibTeX

@inproceedings{qin2025iccv-luminaimage,
  title     = {{Lumina-Image 2.0: A Unified and Efficient Image Generative Framework}},
  author    = {Qin, Qi and Zhuo, Le and Xin, Yi and Du, Ruoyi and Li, Zhen and Fu, Bin and Lu, Yiting and Li, Xinyue and Liu, Dongyang and Zhu, Xiangyang and Beddow, Will and Millon, Erwann and Perez, Victor and Wang, Wenhai and Qiao, Yu and Zhang, Bo and Liu, Xiaohong and Li, Hongsheng and Xu, Chang and Gao, Peng},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {20031-20042},
  url       = {https://mlanthology.org/iccv/2025/qin2025iccv-luminaimage/}
}