Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

Li, Yayuan; Guo, Jintao; Qi, Lei; Li, Wenbin; Shi, Yinghuan

doi:10.1609/AAAI.V39I5.32534

Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

Yayuan Li, Jintao Guo, Lei Qi, Wenbin Li, Yinghuan Shi

AAAI 2025 pp. 5039-5047

doi:10.1609/AAAI.V39I5.32534 /aaai/2025/li2025aaai-text/

Abstract

Contrastive Language-Image Pretraining (CLIP) has been widely used in vision tasks. Notably, CLIP has demonstrated promising performance in few-shot learning (FSL). However, existing CLIP-based methods in training-free FSL (i.e., without the requirement of additional training) mainly learn different modalities independently, leading to two essential issues: 1) severe anomalous match in image modality; 2) varying quality of generated text prompts. To address these issues, we build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component to rectify varying quality of text prompts through image representations, and a Text-Guided-Image (TGI) component to mitigate the anomalous match of image modality through text representations. By integrating IGT and TGI, we adopt a perspective of Text-Image Mutual guidance Optimization, proposing TIMO. Extensive experiments show that TIMO significantly outperforms the state-of-the-art (SOTA) training-free method. Additionally, by exploring the extent of mutual guidance, we propose an enhanced variant, TIMO-S, which even surpasses the best training-required methods by 0.33% with approximately ×100 less time cost.

PDF AAAI Semantic Scholar

Cite

Text

Li et al. "Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I5.32534

Markdown

[Li et al. "Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/li2025aaai-text/) doi:10.1609/AAAI.V39I5.32534

BibTeX

@inproceedings{li2025aaai-text,
  title     = {{Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP}},
  author    = {Li, Yayuan and Guo, Jintao and Qi, Lei and Li, Wenbin and Shi, Yinghuan},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {5039-5047},
  doi       = {10.1609/AAAI.V39I5.32534},
  url       = {https://mlanthology.org/aaai/2025/li2025aaai-text/}
}