Adaptive Prompt-Based Semantic Embedding with Inspire Potential of Implicit Knowledge for Cross-Modal Retrieval

Huang, Xin; Wang, Shilong; Jia, Tong; Gou, Zhihang; Li, Jingjing

doi:10.1609/AAAI.V39I16.33922

Adaptive Prompt-Based Semantic Embedding with Inspire Potential of Implicit Knowledge for Cross-Modal Retrieval

Xin Huang, Shilong Wang, Tong Jia, Zhihang Gou, Jingjing Li

AAAI 2025 pp. 17485-17493

doi:10.1609/AAAI.V39I16.33922 /aaai/2025/huang2025aaai-adaptive/

Abstract

In the era of big data, cross-modal retrieval is increasingly important in research and application. Given the latent complexity and non-intuitive nature of cross-modal relationships, leveraging external knowledge such as large models has become a popular approach to facilitate modality alignment. Existing methods typically address these challenges by fine-tuning model encoders or using a fixed number of prompts. However, these approaches struggle with the significant information asymmetry between image-text pairs and the high distribution diversity of image data. These limitations not only introduce noise during training but also reduce the accuracy and generalization capabilities in cross-modal retrieval tasks. To address the above issues, this paper proposes Adaptive Prompt-Based Semantic Embedding with Inspired Potential of Implicit Knowledge (APSE-IPIK). On one hand, we propose an inspired potential strategy to extract fine-grained and multi-perspective text descriptions from large-scale pre-trained multimodal models, which can be seen as implicit knowledge injection. These descriptions are integrated into the visual-semantic embedding through cross-modal semantic alignment with images, balancing the information asymmetry between modalities and reducing the embedding of inaccurate mapping relationships. On the other hand, we construct an instance-level query-based prompt pool strategy to adaptively extract the most relevant prompts, addressing alignment biases caused by intra-modal (especially image) data diversity and improving alignment accuracy. Extensive experiments are conducted on two widely used datasets, Flickr30k and MSCOCO, which show the effectiveness of the proposed method.

PDF AAAI Semantic Scholar

Cite

Text

Huang et al. "Adaptive Prompt-Based Semantic Embedding with Inspire Potential of Implicit Knowledge for Cross-Modal Retrieval." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I16.33922

Markdown

[Huang et al. "Adaptive Prompt-Based Semantic Embedding with Inspire Potential of Implicit Knowledge for Cross-Modal Retrieval." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/huang2025aaai-adaptive/) doi:10.1609/AAAI.V39I16.33922

BibTeX

@inproceedings{huang2025aaai-adaptive,
  title     = {{Adaptive Prompt-Based Semantic Embedding with Inspire Potential of Implicit Knowledge for Cross-Modal Retrieval}},
  author    = {Huang, Xin and Wang, Shilong and Jia, Tong and Gou, Zhihang and Li, Jingjing},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {17485-17493},
  doi       = {10.1609/AAAI.V39I16.33922},
  url       = {https://mlanthology.org/aaai/2025/huang2025aaai-adaptive/}
}