UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Abstract

Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.

Cite

Text

Fu et al. "UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing." International Conference on Computer Vision, 2025.

Markdown

[Fu et al. "UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/fu2025iccv-univg/)

BibTeX

@inproceedings{fu2025iccv-univg,
  title     = {{UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing}},
  author    = {Fu, Tsu-Jui and Qian, Yusu and Chen, Chen and Hu, Wenze and Gan, Zhe and Yang, Yinfei},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {17160-17170},
  url       = {https://mlanthology.org/iccv/2025/fu2025iccv-univg/}
}