Tuning Multi-Mode Token-Level Prompt Alignment Across Modalities

Abstract

Advancements in prompt tuning of vision-language models have underscored their potential in enhancing open-world visual concept comprehension. However, prior works only primarily focus on single-mode (only one prompt for each modality) and holistic level (image or sentence) semantic alignment, which fails to capture the sample diversity, leading to sub-optimal prompt discovery. To address the limitation, we propose a multi-mode token-level tuning framework that leverages the optimal transportation to learn and align a set of prompt tokens across modalities. Specifically, we rely on two essential factors: 1) multi-mode prompts discovery, which guarantees diverse semantic representations, and 2) token-level alignment, which helps explore fine-grained similarity. Consequently, the similarity can be calculated as a hierarchical transportation problem between the modality-specific sets. Extensive experiments on popular image recognition benchmarks show the superior generalization and few-shot abilities of our approach. The qualitative analysis demonstrates that the learned prompt tokens have the ability to capture diverse visual concepts.

Cite

Text

Wang et al. "Tuning Multi-Mode Token-Level Prompt Alignment Across Modalities." Neural Information Processing Systems, 2023.

Markdown

[Wang et al. "Tuning Multi-Mode Token-Level Prompt Alignment Across Modalities." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/wang2023neurips-tuning/)

BibTeX

@inproceedings{wang2023neurips-tuning,
  title     = {{Tuning Multi-Mode Token-Level Prompt Alignment Across Modalities}},
  author    = {Wang, Dongsheng and Li, Miaoge and Liu, Xinyang and Xu, MingSheng and Chen, Bo and Zhang, Hanwang},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/wang2023neurips-tuning/}
}