Bridging the Modality Gap: Training-Free Adaptation of Vision-Language Models for Remote Sensing via Visual Prototypes
Abstract
Large-scale Vision-Language Models (VLMs) have demonstrated remarkable few-shot learning capabilities across various visual tasks. However, effectively adapting these models to remote sensing, a domain characterized by specialized object appearances and scarce labeled data, remains non-trivial. In this work, we present a training-free adaptation strategy that employs region-level visual prototypes for object detection in remote sensing imagery. Instead of relying on textual prompts, we directly derive representative embeddings from a small number of annotated bounding boxes, capturing domain-specific characteristics that generic language encoders may overlook. To compensate for the resulting modality gap between region-region and region-text similarities, we introduce an affine normalization step that re-calibrates prototype-based scores without any model fine-tuning. We evaluate our method on the DIOR and NWPU-VHR10 benchmarks, demonstrating consistent and substantial improvements over previous training-free approaches. Moreover, we offer an in-depth analysis of different prototype construction and aggregation strategies, revealing how carefully chosen protocols can further strengthen few-shot detection in remote sensing.
Cite
Text
Barbier et al. "Bridging the Modality Gap: Training-Free Adaptation of Vision-Language Models for Remote Sensing via Visual Prototypes." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.Markdown
[Barbier et al. "Bridging the Modality Gap: Training-Free Adaptation of Vision-Language Models for Remote Sensing via Visual Prototypes." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/barbier2025cvprw-bridging/)BibTeX
@inproceedings{barbier2025cvprw-bridging,
title = {{Bridging the Modality Gap: Training-Free Adaptation of Vision-Language Models for Remote Sensing via Visual Prototypes}},
author = {Barbier, Clément and Abeloss, Baptiste and Herbin, Stéphane},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2025},
pages = {3057-3066},
url = {https://mlanthology.org/cvprw/2025/barbier2025cvprw-bridging/}
}