DIP: Unsupervised Dense In-Context Post-Training of Visual Representations
Abstract
We introduce DIP, a novel unsupervised post-training method designed to enhance dense representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches using complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder. DI"P is simple, unsupervised, and computationally efficient, requiring under 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations.
Cite
Text
Sirko-Galouchenko et al. "DIP: Unsupervised Dense In-Context Post-Training of Visual Representations." International Conference on Computer Vision, 2025.Markdown
[Sirko-Galouchenko et al. "DIP: Unsupervised Dense In-Context Post-Training of Visual Representations." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/sirkogalouchenko2025iccv-dip/)BibTeX
@inproceedings{sirkogalouchenko2025iccv-dip,
title = {{DIP: Unsupervised Dense In-Context Post-Training of Visual Representations}},
author = {Sirko-Galouchenko, Sophia and Gidaris, Spyros and Vobecky, Antonin and Bursuc, Andrei and Thome, Nicolas},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {4264-4274},
url = {https://mlanthology.org/iccv/2025/sirkogalouchenko2025iccv-dip/}
}