ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval

Abstract

The objective of Composed Image Retrieval (CIR) is to identify a target image that meets the requirement based on a multimodal query (including the reference image and the modification text) provided by the user. Despite the notable success of existing approaches, they fail to adequately address the modification relation between visual entities and modification actions. This limitation is non-trivial due to three challenges: 1) irrelevant factor perturbation, 2) vague semantic boundaries, and 3) implicit modification relations. To address the above challenges, we propose an Entity miNing and modifiCation relatiOn binDing nEtwoRk (ENCODER), which has been designed to mine visual entities and modification actions, and then bind modification relations. Among the various components of the proposed ENCODER, we have initially designed the Latent Factor Filter (LFF) module to filter visual and textual latent factors related to modification semantics based on a threshold gating mechanism. Secondly, we propose Entity-Action Binding (EAB), which comprises modality-shared Learnable Relation Queries (LRQ) that are capable of mining visual entities and modification actions, as well as learning implicit modification relations for entity-action binding. Finally, the Multi-scale Composition module is introduced to achieve multi-scale feature composition, with guidance provided by entity-action binding. Extensive experiments on four benchmark datasets demonstrate the superiority of our proposed method.

Cite

Text

Li et al. "ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I5.32541

Markdown

[Li et al. "ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/li2025aaai-encoder/) doi:10.1609/AAAI.V39I5.32541

BibTeX

@inproceedings{li2025aaai-encoder,
  title     = {{ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval}},
  author    = {Li, Zixu and Chen, Zhiwei and Wen, Haokun and Fu, Zhiheng and Hu, Yupeng and Guan, Weili},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {5101-5109},
  doi       = {10.1609/AAAI.V39I5.32541},
  url       = {https://mlanthology.org/aaai/2025/li2025aaai-encoder/}
}