Modality-Aware Adaptation of Contrastive Language-Image Models

Abstract

Despite their high levels of robustness, Contrastive Language-Image Models (CLIP) still require some form of downstream adaptation when applied to tasks sufficiently out-of-domain with respect to their training set. Recent methods propose light-weight adapters on the model features, primarily focused on the few-shot domain. All such approaches however, require per-task hyperparameter tuning which necessitates access to a validation set; limiting their applicability in practice. As an alternative, we propose Modality Aware Tangent-space Retrieval (MATeR), a training-free, interpretable adapter which outperforms all recent methods when per-task hyperparameter tuning is prohibited. MATeR considers the manifold formed by CLIP embeddings when incorporating out of domain few-shot class information and its predictions are invariant to the modality gap; representing the first approach that considers the geometric structure of the CLIP latent space to inform downstream task adaptation. Additionally, we demonstrate a variant of MATeR has the ability to significantly increase zero-shot accuracy with only a handful of unlabelled images, much lower than the number of classes.

Cite

Text

Long et al. "Modality-Aware Adaptation of Contrastive Language-Image Models." ICLR 2023 Workshops: ME-FoMo, 2023.

Markdown

[Long et al. "Modality-Aware Adaptation of Contrastive Language-Image Models." ICLR 2023 Workshops: ME-FoMo, 2023.](https://mlanthology.org/iclrw/2023/long2023iclrw-modalityaware/)

BibTeX

@inproceedings{long2023iclrw-modalityaware,
  title     = {{Modality-Aware Adaptation of Contrastive Language-Image Models}},
  author    = {Long, Alexander and Ajanthan, Thalaiyasingam and van den Hengel, Anton},
  booktitle = {ICLR 2023 Workshops: ME-FoMo},
  year      = {2023},
  url       = {https://mlanthology.org/iclrw/2023/long2023iclrw-modalityaware/}
}