UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

Yue, Zhengrong; Zhang, Haiyu; Zeng, Xiangyu; Chen, Boyu; Wang, Chenting; Zhuang, Shaobin; Dong, Lu; Wang, Yi; Wang, Limin; Wang, Yali

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

Zhengrong Yue, Haiyu Zhang, Xiangyu Zeng, Boyu Chen, Chenting Wang, Shaobin Zhuang, Lu Dong, Yi Wang, Limin Wang, Yali Wang

ICLR 2026

/iclr/2026/yue2026iclr-uniflow/

Abstract

Tokenizer is a crucial component for both visual understanding and generation. To advance toward the ultimate goal of universal modeling, recent research has focused on developing a unified tokenizer. However, existing tokenizers face a significant performance trade-off between understanding and generation, stemming from the inherent conflict between high-level semantic abstraction and low-level pixel reconstruction. To tackle this challenge, we propose a generic and unified tokenizer, namely $\textbf{UniFlow}$, by flexibly adapting any visual encoder with a concise reconstruction decoder. Specifically, we introduce $\textit{layer-wise adaptive self-distillation}$ applied to the well-pretrained visual encoders, which enables UniFlow to simultaneously inherit the strong semantic features for visual understanding and flexibly adapt to model fine-grained details for visual generation. Moreover, we propose a lightweight $\textit{patch-wise pixel flow decoder}$, which efficiently achieves high-fidelity pixel reconstruction by modeling a conditional flow from the noisy state back to the patch-wise pixel domain. By leveraging the semantic features as visual conditions for the decoder, we effectively alleviate the training conflicts between understanding and generation. Furthermore, the patch-wise learning strategy simplifies the data distribution, thereby improving training efficiency. For instance, our 7B UniFlow-XL not only surpasses the 14B TokenFlow-XL by 6.05\% on average understanding benchmarks, but also achieves a competitive results in both visual reconstruction and generation, surpassing UniTok by 0.15 in rFID and 0.09 in gFID (without guidance), respectively.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Yue et al. "UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation." International Conference on Learning Representations, 2026.

Markdown

[Yue et al. "UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/yue2026iclr-uniflow/)

BibTeX

@inproceedings{yue2026iclr-uniflow,
  title     = {{UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation}},
  author    = {Yue, Zhengrong and Zhang, Haiyu and Zeng, Xiangyu and Chen, Boyu and Wang, Chenting and Zhuang, Shaobin and Dong, Lu and Wang, Yi and Wang, Limin and Wang, Yali},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/yue2026iclr-uniflow/}
}