Bridging the Modality Gap: Training-Free Adaptation of Vision-Language Models for Remote Sensing via Visual Prototypes

Barbier, Clément; Abeloss, Baptiste; Herbin, Stéphane

Bridging the Modality Gap: Training-Free Adaptation of Vision-Language Models for Remote Sensing via Visual Prototypes

Clément Barbier, Baptiste Abeloss, Stéphane Herbin

CVPRW 2025 pp. 3057-3066

/cvprw/2025/barbier2025cvprw-bridging/

Abstract

Large-scale Vision-Language Models (VLMs) have demonstrated remarkable few-shot learning capabilities across various visual tasks. However, effectively adapting these models to remote sensing, a domain characterized by specialized object appearances and scarce labeled data, remains non-trivial. In this work, we present a training-free adaptation strategy that employs region-level visual prototypes for object detection in remote sensing imagery. Instead of relying on textual prompts, we directly derive representative embeddings from a small number of annotated bounding boxes, capturing domain-specific characteristics that generic language encoders may overlook. To compensate for the resulting modality gap between region-region and region-text similarities, we introduce an affine normalization step that re-calibrates prototype-based scores without any model fine-tuning. We evaluate our method on the DIOR and NWPU-VHR10 benchmarks, demonstrating consistent and substantial improvements over previous training-free approaches. Moreover, we offer an in-depth analysis of different prototype construction and aggregation strategies, revealing how carefully chosen protocols can further strengthen few-shot detection in remote sensing.

PDF CVPRW Semantic Scholar

Cite

Text

Barbier et al. "Bridging the Modality Gap: Training-Free Adaptation of Vision-Language Models for Remote Sensing via Visual Prototypes." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Barbier et al. "Bridging the Modality Gap: Training-Free Adaptation of Vision-Language Models for Remote Sensing via Visual Prototypes." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/barbier2025cvprw-bridging/)

BibTeX

@inproceedings{barbier2025cvprw-bridging,
  title     = {{Bridging the Modality Gap: Training-Free Adaptation of Vision-Language Models for Remote Sensing via Visual Prototypes}},
  author    = {Barbier, Clément and Abeloss, Baptiste and Herbin, Stéphane},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {3057-3066},
  url       = {https://mlanthology.org/cvprw/2025/barbier2025cvprw-bridging/}
}