Reconciling Visual Perception and Generation in Diffusion Models
Abstract
We present \textsc{GenRep}, a unified image understanding and synthesis model that jointly conducts discriminative learning and generative modeling in one training session. By leveraging Monte Carlo approximation, \textsc{GenRep} distills distributional knowledge embedded in diffusion models to guide the discriminative learning for visual perception tasks. Simultaneously, a semantic-driven image generation process is established, where high-level semantics learned from perception tasks can be used to inform image synthesis, creating a positive feedback loop for mutual boosts. Moreover, to reconcile the learning process for both tasks, a gradient alignment strategy is proposed to symmetrically modify the optimization directions of perception and generation losses. These designs empower \textsc{GenRep} to be a versatile and powerful model that achieves top-leading performance on both image understanding and generation benchmarks.
Cite
Text
Li et al. "Reconciling Visual Perception and Generation in Diffusion Models." International Conference on Learning Representations, 2026.Markdown
[Li et al. "Reconciling Visual Perception and Generation in Diffusion Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-reconciling/)BibTeX
@inproceedings{li2026iclr-reconciling,
title = {{Reconciling Visual Perception and Generation in Diffusion Models}},
author = {Li, Liulei and Yang, Yi and Wang, Wenguan},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/li2026iclr-reconciling/}
}