Conditioned and Composed Image Retrieval Combining and Partially Fine-Tuning CLIP-Based Features
Abstract
In this paper, we present an approach for conditioned and composed image retrieval based on CLIP features. In this extension of content-based image retrieval (CBIR) an image is combined with a text that provides information regarding user intentions, and is relevant for application domains like e-commerce. The proposed method is based on an initial training stage where a simple combination of visual and textual features is used, to fine-tune the CLIP text encoder. Then in a second training stage we learn a more complex combiner network that merges visual and textual features. Contrastive learning is used in both stages. The proposed approach obtains state-of-the-art performance for conditioned CBIR on the FashionIQ dataset and for composed CBIR on the more recent CIRR dataset.
Cite
Text
Baldrati et al. "Conditioned and Composed Image Retrieval Combining and Partially Fine-Tuning CLIP-Based Features." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022. doi:10.1109/CVPRW56347.2022.00543Markdown
[Baldrati et al. "Conditioned and Composed Image Retrieval Combining and Partially Fine-Tuning CLIP-Based Features." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022.](https://mlanthology.org/cvprw/2022/baldrati2022cvprw-conditioned/) doi:10.1109/CVPRW56347.2022.00543BibTeX
@inproceedings{baldrati2022cvprw-conditioned,
title = {{Conditioned and Composed Image Retrieval Combining and Partially Fine-Tuning CLIP-Based Features}},
author = {Baldrati, Alberto and Bertini, Marco and Uricchio, Tiberio and Del Bimbo, Alberto},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2022},
pages = {4955-4964},
doi = {10.1109/CVPRW56347.2022.00543},
url = {https://mlanthology.org/cvprw/2022/baldrati2022cvprw-conditioned/}
}