UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

Abstract

Tokenizer is a crucial component for both visual understanding and generation. To advance toward the ultimate goal of universal modeling, recent research has focused on developing a unified tokenizer. However, existing tokenizers face a significant performance trade-off between understanding and generation, stemming from the inherent conflict between high-level semantic abstraction and low-level pixel reconstruction. To tackle this challenge, we propose a generic and unified tokenizer, namely $\textbf{UniFlow}$, by flexibly adapting any visual encoder with a concise reconstruction decoder. Specifically, we introduce $\textit{layer-wise adaptive self-distillation}$ applied to the well-pretrained visual encoders, which enables UniFlow to simultaneously inherit the strong semantic features for visual understanding and flexibly adapt to model fine-grained details for visual generation. Moreover, we propose a lightweight $\textit{patch-wise pixel flow decoder}$, which efficiently achieves high-fidelity pixel reconstruction by modeling a conditional flow from the noisy state back to the patch-wise pixel domain. By leveraging the semantic features as visual conditions for the decoder, we effectively alleviate the training conflicts between understanding and generation. Furthermore, the patch-wise learning strategy simplifies the data distribution, thereby improving training efficiency. For instance, our 7B UniFlow-XL not only surpasses the 14B TokenFlow-XL by 6.05\% on average understanding benchmarks, but also achieves a competitive results in both visual reconstruction and generation, surpassing UniTok by 0.15 in rFID and 0.09 in gFID (without guidance), respectively.

Cite

Text

Yue et al. "UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation." International Conference on Learning Representations, 2026.

Markdown

[Yue et al. "UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/yue2026iclr-uniflow/)

BibTeX

@inproceedings{yue2026iclr-uniflow,
  title     = {{UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation}},
  author    = {Yue, Zhengrong and Zhang, Haiyu and Zeng, Xiangyu and Chen, Boyu and Wang, Chenting and Zhuang, Shaobin and Dong, Lu and Wang, Yi and Wang, Limin and Wang, Yali},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/yue2026iclr-uniflow/}
}