Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
Abstract
We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained end-to-end using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring it remains smooth and suitable for generation. Our single-token formulation resolves the spatial redundancies of the 2D latent space, simplifies architectures, and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and extends naturally to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling. We will release our model to facilitate further research.
Cite
Text
Gui et al. "Adapting Self-Supervised Representations as a Latent Space for Efficient Generation." International Conference on Learning Representations, 2026.Markdown
[Gui et al. "Adapting Self-Supervised Representations as a Latent Space for Efficient Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/gui2026iclr-adapting/)BibTeX
@inproceedings{gui2026iclr-adapting,
title = {{Adapting Self-Supervised Representations as a Latent Space for Efficient Generation}},
author = {Gui, Ming and Schusterbauer, Johannes and Phan, Timy and Krause, Felix and Susskind, Joshua M. and Bautista, Miguel Ángel and Ommer, Björn},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/gui2026iclr-adapting/}
}