RA-TTA: Retrieval-Augmented Test-Time Adaptation for Vision-Language Models
Abstract
Vision-language models (VLMs) are known to be susceptible to distribution shifts between pre-training data and test data, and test-time adaptation (TTA) methods for VLMs have been proposed to mitigate the detrimental impact of the distribution shifts. However, the existing methods solely rely on the internal knowledge encoded within the model parameters, which are constrained to pre-training data. To complement the limitation of the internal knowledge, we propose **Retrieval-Augmented-TTA (RA-TTA)** for adapting VLMs to test distribution using **external** knowledge obtained from a web-scale image database. By fully exploiting the bi-modality of VLMs, RA-TTA **adaptively** retrieves proper external images for each test image to refine VLMs' predictions using the retrieved external images, where fine-grained **text descriptions** are leveraged to extend the granularity of external knowledge. Extensive experiments on 17 datasets demonstrate that the proposed RA-TTA outperforms the state-of-the-art methods by 3.01-9.63\% on average.
Cite
Text
Lee et al. "RA-TTA: Retrieval-Augmented Test-Time Adaptation for Vision-Language Models." International Conference on Learning Representations, 2025.Markdown
[Lee et al. "RA-TTA: Retrieval-Augmented Test-Time Adaptation for Vision-Language Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/lee2025iclr-ratta/)BibTeX
@inproceedings{lee2025iclr-ratta,
title = {{RA-TTA: Retrieval-Augmented Test-Time Adaptation for Vision-Language Models}},
author = {Lee, Youngjun and Kim, Doyoung and Kang, Junhyeok and Bang, Jihwan and Song, Hwanjun and Lee, Jae-Gil},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/lee2025iclr-ratta/}
}