Semantic Alignment for Prompt-Tuning in Vision Language Models

Abstract

Going beyond mere fine-tuning of vision-language models (VLMs), learnable prompt tuning has emerged as a promising, resource-efficient alternative. Despite their potential, effectively learning prompts faces the following challenges: (i) training in a low-shot scenario results in overfitting, limiting adaptability, and yielding weaker performance on newer classes or datasets; (ii) prompt-tuning's efficacy heavily relies on the label space, with decreased performance in large class spaces, signaling potential gaps in bridging image and class concepts. In this work, we investigate whether better text semantics can help address these concerns. In particular, we introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models (LLMs). These class descriptions are used to bridge image and text modalities. Our approach constructs part-level description-guided image and text features, which are subsequently aligned to learn more generalizable prompts. Our comprehensive experiments conducted across 11 benchmark datasets show that our method outperforms established methods, demonstrating substantial improvements.

Cite

Text

Kuchibhotla et al. "Semantic Alignment for Prompt-Tuning in Vision Language Models." Transactions on Machine Learning Research, 2025.

Markdown

[Kuchibhotla et al. "Semantic Alignment for Prompt-Tuning in Vision Language Models." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/kuchibhotla2025tmlr-semantic/)

BibTeX

@article{kuchibhotla2025tmlr-semantic,
  title     = {{Semantic Alignment for Prompt-Tuning in Vision Language Models}},
  author    = {Kuchibhotla, Hari Chandana and Kancheti, Sai Srinivas and Reddy, Abbavaram Gowtham and Balasubramanian, Vineeth N.},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/kuchibhotla2025tmlr-semantic/}
}