VESSA: Video-Based objEct-Centric Self-Supervised Adaptation for Visual Foundation Models

Barreto, Jesimon; Caetano, Carlos; Araujo, Andre; Schwartz, William Robson

VESSA: Video-Based objEct-Centric Self-Supervised Adaptation for Visual Foundation Models

Jesimon Barreto, Carlos Caetano, Andre Araujo, William Robson Schwartz

NeurIPS 2025

/neurips/2025/barreto2025neurips-vessa/

Abstract

Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: **V**ideo-based obj**E**ct-centric **S**elf-**S**upervised **A**daptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques – otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Barreto et al. "VESSA: Video-Based objEct-Centric Self-Supervised Adaptation for Visual Foundation Models." Advances in Neural Information Processing Systems, 2025.

Markdown

[Barreto et al. "VESSA: Video-Based objEct-Centric Self-Supervised Adaptation for Visual Foundation Models." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/barreto2025neurips-vessa/)

BibTeX

@inproceedings{barreto2025neurips-vessa,
  title     = {{VESSA: Video-Based objEct-Centric Self-Supervised Adaptation for Visual Foundation Models}},
  author    = {Barreto, Jesimon and Caetano, Carlos and Araujo, Andre and Schwartz, William Robson},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/barreto2025neurips-vessa/}
}