Reconstruction Alignment Improves Unified Multimodal Models

Abstract

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image–text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details, even when they use hundreds of words to describe a simple image. We introduce **Reconstruction Alignment (RecA)**, a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense “text prompts,” providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73 → 0.90) and DPGBench (80.93 → 88.15), while also boosting editing benchmarks (ImgEdit 3.38 → 3.75, GEdit 6.94 → 7.27). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs.

Cite

Text

Xie et al. "Reconstruction Alignment Improves Unified Multimodal Models." International Conference on Learning Representations, 2026.

Markdown

[Xie et al. "Reconstruction Alignment Improves Unified Multimodal Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/xie2026iclr-reconstruction/)

BibTeX

@inproceedings{xie2026iclr-reconstruction,
  title     = {{Reconstruction Alignment Improves Unified Multimodal Models}},
  author    = {Xie, Ji and Darrell, Trevor and Zettlemoyer, Luke and Wang, XuDong},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/xie2026iclr-reconstruction/}
}