RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Yijing Lin, Mengqi Huang, Shuhan Zhuang, Zhendong Mao

ICCV 2025 pp. 14994-15004

/iccv/2025/lin2025iccv-realgeneral/

Abstract

Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project Page: https://lyne1.github.io/realgeneral_web/; GitHub Link: https://github.com/Lyne1/RealGeneral

PDF ICCV Semantic Scholar

Cite

Text

Lin et al. "RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models." International Conference on Computer Vision, 2025.

Markdown

[Lin et al. "RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/lin2025iccv-realgeneral/)

BibTeX

@inproceedings{lin2025iccv-realgeneral,
  title     = {{RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models}},
  author    = {Lin, Yijing and Huang, Mengqi and Zhuang, Shuhan and Mao, Zhendong},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {14994-15004},
  url       = {https://mlanthology.org/iccv/2025/lin2025iccv-realgeneral/}
}