Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Abstract
Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the JinaCLIP model and achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.
Cite
Text
Xiao et al. "Jina CLIP: Your CLIP Model Is Also Your Text Retriever." ICML 2024 Workshops: MFM-EAI, 2024.Markdown
[Xiao et al. "Jina CLIP: Your CLIP Model Is Also Your Text Retriever." ICML 2024 Workshops: MFM-EAI, 2024.](https://mlanthology.org/icmlw/2024/xiao2024icmlw-jina/)BibTeX
@inproceedings{xiao2024icmlw-jina,
title = {{Jina CLIP: Your CLIP Model Is Also Your Text Retriever}},
author = {Xiao, Han and Mastrapas, Georgios and Wang, Bo},
booktitle = {ICML 2024 Workshops: MFM-EAI},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/xiao2024icmlw-jina/}
}