CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback

Abstract

We tackle the task of image retrieval with text feedback, where a reference image and modifier text are combined to identify the desired target image. We focus on designing an image-text compositor, i.e., integrating multi-modal inputs to produce a representation similar to that of the target image. In our algorithm, Content-Style Modulation (CoSMo), we approach this challenge by introducing two modules based on deep neural networks: the content and style modulators. The content modulator performs local updates to the reference image feature after normalizing the style of the image, where a disentangled multi-modal non-local block is employed to achieve the desired content modifications. Then, the style modulator reintroduces global style information to the updated feature. We provide an in-depth view of our algorithm and its design choices, and show that it accomplishes outstanding performance on multiple image-text retrieval benchmarks. Our code can be found at: https://github.com/postBG/CosMo.pytorch

Cite

Text

Lee et al. "CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00086

Markdown

[Lee et al. "CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/lee2021cvpr-cosmo/) doi:10.1109/CVPR46437.2021.00086

BibTeX

@inproceedings{lee2021cvpr-cosmo,
  title     = {{CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback}},
  author    = {Lee, Seungmin and Kim, Dongwan and Han, Bohyung},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {802-812},
  doi       = {10.1109/CVPR46437.2021.00086},
  url       = {https://mlanthology.org/cvpr/2021/lee2021cvpr-cosmo/}
}