Show-O2: Improved Native Unified Multimodal Models

Abstract

This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

Cite

Text

Xie et al. "Show-O2: Improved Native Unified Multimodal Models." Advances in Neural Information Processing Systems, 2025.

Markdown

[Xie et al. "Show-O2: Improved Native Unified Multimodal Models." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/xie2025neurips-showo2/)

BibTeX

@inproceedings{xie2025neurips-showo2,
  title     = {{Show-O2: Improved Native Unified Multimodal Models}},
  author    = {Xie, Jinheng and Yang, Zhenheng and Shou, Mike Zheng},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/xie2025neurips-showo2/}
}