A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-Time Adaptation for Vision-Language Models

Abstract

In deep learning, maintaining model robustness against distribution shifts is critical. This work explores a broad range of possibilities to adapt vision-language foundation models at test-time, with a particular emphasis on CLIP [ 37 ] and its variants. The study systematically examines prompt-based techniques and existing test-time adaptation methods, aiming to improve the robustness under distribution shift in diverse real-world scenarios. Specifically, the investigation covers various prompt engineering strategies, including handcrafted prompts, prompt ensembles, and prompt learning techniques. Additionally, we introduce a vision-text-space ensemble that substantially enhances average performance compared to text-space-only ensembles. Since online test-time adaptation has shown to be effective to mitigate performance drops under distribution shift, the study extends its scope to evaluate the effectiveness of existing test-time adaptation methods that were originally designed for vision-only classification models. Through extensive experimental evaluations conducted across multiple datasets and diverse model architectures, the research demonstrates the effectiveness of these adaptation strategies. Code is available at: https://github.com/mariodoebler/test-time-adaptation .

Cite

Text

Döbler et al. "A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-Time Adaptation for Vision-Language Models." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-91672-4_8

Markdown

[Döbler et al. "A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-Time Adaptation for Vision-Language Models." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/dobler2024eccvw-lost/) doi:10.1007/978-3-031-91672-4_8

BibTeX

@inproceedings{dobler2024eccvw-lost,
  title     = {{A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-Time Adaptation for Vision-Language Models}},
  author    = {Döbler, Mario and Marsden, Robert A. and Raichle, Tobias and Yang, Bin},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2024},
  pages     = {117-133},
  doi       = {10.1007/978-3-031-91672-4_8},
  url       = {https://mlanthology.org/eccvw/2024/dobler2024eccvw-lost/}
}