Learning to Decompose Visual Features with Latent Textual Prompts

Feng Wang, Manling Li, Xudong Lin, Hairong Lv, Alex Schwing, Heng Ji

ICLR 2023

/iclr/2023/wang2023iclr-learning-a/

Abstract

Recent advances in pre-training vision-language models like CLIP have shown great potential in learning transferable visual representations. Nonetheless, for downstream inference, CLIP-like models suffer from either 1) degraded accuracy and robustness in the case of inaccurate text descriptions during retrieval-based inference (the challenge for zero-shot protocol); or 2) breaking the well-established vision-language alignment (the challenge for linear probing). To address them, we propose Decomposed Feature Prompting (DeFo). DeFo leverages a flexible number of learnable embeddings as textual input while maintaining the vision-language dual-model architecture, which enables the model to learn decomposed visual features with the help of feature-level textual prompts. We further use an additional linear layer to perform classification, allowing a scalable size of language inputs. Our empirical study shows DeFo's significance in improving the vision-language models. For example, DeFo obtains 73.2% test accuracy on ImageNet with a ResNet-50 backbone without tuning any pretrained weights of both the vision and language encoder, outperforming zero-shot CLIP by a large margin of 15.0%, and outperforming state-of-the-art vision-language prompt tuning method by 7.6%.

PDF ICLR Semantic Scholar

Cite

Text

Wang et al. "Learning to Decompose Visual Features with Latent Textual Prompts." International Conference on Learning Representations, 2023.

Markdown

[Wang et al. "Learning to Decompose Visual Features with Latent Textual Prompts." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/wang2023iclr-learning-a/)

BibTeX

@inproceedings{wang2023iclr-learning-a,
  title     = {{Learning to Decompose Visual Features with Latent Textual Prompts}},
  author    = {Wang, Feng and Li, Manling and Lin, Xudong and Lv, Hairong and Schwing, Alex and Ji, Heng},
  booktitle = {International Conference on Learning Representations},
  year      = {2023},
  url       = {https://mlanthology.org/iclr/2023/wang2023iclr-learning-a/}
}