An Efficient Post-Hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval
Abstract
Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable image searches. The mainstream Zero-Shot (ZS) CIR methods bypass the need for expensive training CIR triplets by projecting image embeddings into the text token embedding space, forming a composed query for retrieval. However, we highlight an inherent limitation in these projection-based CIR: a task discrepancy of text encoders between the original pre-training task of the encoders (text \leftrightarrow image) and the target CIR task (image + text \leftrightarrow image), which potentially negatively impacts CIR performance. To reduce such a discrepancy, a naive solution would be to train both image and text encoders with CIR triplets in a supervised manner. Instead, we introduce Reducing Task Discrepancy of Text encoders (RTD), an efficient text-only post-hoc framework that complements projection-based CIR methods. We devise a novel frozen-target text contrastive learning designed to enhance the capability of the text encoder for CIR. We also propose two key enhancements: (1) a hard negative-based refined batch sampling strategy and (2) a refined concatenation scheme to further mitigate training-inference discrepancy. Integrating RTD into state-of-the-art projection-based methods achieves performance comparable to, or even surpassing, resource-intensive state-of-the-art synthetic CIR triplet-based approaches, only with 23 minutes of additional training on 4 A100 GPUs. Our code is available in https://github.com/jaeseokbyun/RTD.
Cite
Text
Byun et al. "An Efficient Post-Hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval." International Conference on Computer Vision, 2025.Markdown
[Byun et al. "An Efficient Post-Hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/byun2025iccv-efficient/)BibTeX
@inproceedings{byun2025iccv-efficient,
title = {{An Efficient Post-Hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval}},
author = {Byun, Jaeseok and Jeong, Seokhyeon and Kim, Wonjae and Chun, Sanghyuk and Moon, Taesup},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {3895-3904},
url = {https://mlanthology.org/iccv/2025/byun2025iccv-efficient/}
}