UniF$^2$ace: A $\underline{Uni}$fied $\underline{F}$ine-Grained $\underline{Face}$ Understanding and Generation Model

Li, Junzhe; Zhou, Sifan; Guo, Liya; Qiu, Xuerui; Xu, Linrui; Long, TingTing; Fan, Chun; Li, Ming; Fan, Hehe; Liu, Jun; Yan, Shuicheng

UniF$^2$ace: A $\underline{Uni}$fied $\underline{F}$ine-Grained $\underline{Face}$ Understanding and Generation Model

Junzhe Li, Sifan Zhou, Liya Guo, Xuerui Qiu, Linrui Xu, TingTing Long, Chun Fan, Ming Li, Hehe Fan, Jun Liu, Shuicheng Yan

ICLR 2026

/iclr/2026/li2026iclr-unif/

Abstract

Unified multimodal models (UMMs) have emerged as a powerful paradigm in fundamental cross-modality research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily faces two challenges: **(1) fragmentation development**, with existing methods failing to unify understanding and generation into a single one, hindering the way to artificial general intelligence. **(2) lack of fine-grained facial attributes**, which are crucial for high-fidelity applications. To handle those issues, we propose UniF$^2$ace, the first UMM specifically tailored for fine-grained face understanding and generation. **First**, we introduce a novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss, unifying masked generative models with discrete score matching diffusion and leading to a more precise approximation of the negative log-likelihood. Moreover, this D3Diff significantly enhances the model's ability to synthesize high-fidelity facial details aligned with text input. **Second**, we propose a multi-level grouped Mixture-of-Experts architecture, adaptively incorporating the semantic and identity facial embeddings to complement the attribute forgotten phenomenon in representation evolvement. **Finally**, to this end, we construct UniF$^2$aceD-1M, a large-scale dataset comprising *130K* fine-grained image-caption pairs and *1M* visual question-answering pairs, spanning a much wider range of facial attributes than existing datasets. Extensive experiments demonstrate that UniF$^2$ace outperforms existing models with a similar scale in both understanding and generation tasks, with 7.1% higher Desc-GPT and 6.6% higher VQA-score, respectively. Code is available in the supplementary materials.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Li et al. "UniF$^2$ace: A $\underline{Uni}$fied $\underline{F}$ine-Grained $\underline{Face}$ Understanding and Generation Model." International Conference on Learning Representations, 2026.

Markdown

[Li et al. "UniF$^2$ace: A $\underline{Uni}$fied $\underline{F}$ine-Grained $\underline{Face}$ Understanding and Generation Model." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-unif/)

BibTeX

@inproceedings{li2026iclr-unif,
  title     = {{UniF$^2$ace: A $\underline{Uni}$fied $\underline{F}$ine-Grained $\underline{Face}$ Understanding and Generation Model}},
  author    = {Li, Junzhe and Zhou, Sifan and Guo, Liya and Qiu, Xuerui and Xu, Linrui and Long, TingTing and Fan, Chun and Li, Ming and Fan, Hehe and Liu, Jun and Yan, Shuicheng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/li2026iclr-unif/}
}