One Last Attention for Your Vision-Language Model

Abstract

Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, i.e., rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective Rational Adaptaion (RAda) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings (i.e., updating, or freezing pretrained encoders in adaptation, and test-time training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings.

Cite

Text

Chen et al. "One Last Attention for Your Vision-Language Model." International Conference on Computer Vision, 2025.

Markdown

[Chen et al. "One Last Attention for Your Vision-Language Model." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/chen2025iccv-one/)

BibTeX

@inproceedings{chen2025iccv-one,
  title     = {{One Last Attention for Your Vision-Language Model}},
  author    = {Chen, Liang and Ahmad, Ghazi Shazan and Yao, Tianjun and Liu, Lingqiao and Shen, Zhiqiang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {1464-1473},
  url       = {https://mlanthology.org/iccv/2025/chen2025iccv-one/}
}