Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Gui, Ming; Schusterbauer, Johannes; Phan, Timy; Krause, Felix; Susskind, Joshua M.; Bautista, Miguel Ángel; Ommer, Björn

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Joshua M. Susskind, Miguel Ángel Bautista, Björn Ommer

ICLR 2026

/iclr/2026/gui2026iclr-adapting/

Abstract

We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained end-to-end using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring it remains smooth and suitable for generation. Our single-token formulation resolves the spatial redundancies of the 2D latent space, simplifies architectures, and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and extends naturally to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling. We will release our model to facilitate further research.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Gui et al. "Adapting Self-Supervised Representations as a Latent Space for Efficient Generation." International Conference on Learning Representations, 2026.

Markdown

[Gui et al. "Adapting Self-Supervised Representations as a Latent Space for Efficient Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/gui2026iclr-adapting/)

BibTeX

@inproceedings{gui2026iclr-adapting,
  title     = {{Adapting Self-Supervised Representations as a Latent Space for Efficient Generation}},
  author    = {Gui, Ming and Schusterbauer, Johannes and Phan, Timy and Krause, Felix and Susskind, Joshua M. and Bautista, Miguel Ángel and Ommer, Björn},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/gui2026iclr-adapting/}
}