Reason-Before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

Tang, Yuanmin; Zhang, Jue; Qin, Xiaoting; Yu, Jing; Gou, Gaopeng; Xiong, Gang; Lin, Qingwei; Rajmohan, Saravan; Zhang, Dongmei; Wu, Qi

doi:10.1109/CVPR52734.2025.01343

Reason-Before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Wu

CVPR 2025 pp. 14400-14410

doi:10.1109/CVPR52734.2025.01343 /cvpr/2025/tang2025cvpr-reasonbeforeretrieve/

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image while integrating user-specified textual modifications, thereby capturing user intent more accurately. Existing training-free zero-shot CIR (ZS-CIR) methods often employ a two-stage process: they first generate a caption for the reference image and then use Large Language Models for reasoning a target description. However, these methods suffer from missing critical visual details and limited reasoning capabilities, leading to suboptimal retrieval performance. To address these challenges, we propose a novel, training-free one-stage method, One-Stage Reflective Chain-of-Thought Reasoning (OSrCIR) for ZS-CIR, which employs Multimodal Large Language Models to retain essential visual information in a single-stage reasoning process, eliminating the information loss in two-stage methods. Our Reflective Chain-of-Thought framework further improves interpretative accuracy by aligning manipulation intent with contextual cues from reference images. OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks, setting new state-of-the-art results in ZS-CIR and enhancing its utility in vision-language applications. Our code is available at https://github.com/microsoft/ACV/tree/main/OSrCIR.

PDF CVPR Semantic Scholar

Cite

Text

Tang et al. "Reason-Before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01343

Markdown

[Tang et al. "Reason-Before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/tang2025cvpr-reasonbeforeretrieve/) doi:10.1109/CVPR52734.2025.01343

BibTeX

@inproceedings{tang2025cvpr-reasonbeforeretrieve,
  title     = {{Reason-Before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval}},
  author    = {Tang, Yuanmin and Zhang, Jue and Qin, Xiaoting and Yu, Jing and Gou, Gaopeng and Xiong, Gang and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei and Wu, Qi},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {14400-14410},
  doi       = {10.1109/CVPR52734.2025.01343},
  url       = {https://mlanthology.org/cvpr/2025/tang2025cvpr-reasonbeforeretrieve/}
}