FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback

Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, Rakesh Chada, Yue Wu, Varsha Hedau, Pradeep Natarajan

CVPR 2022 pp. 14105-14115

doi:10.1109/CVPR52688.2022.01371 /cvpr/2022/goenka2022cvpr-fashionvlp/

Abstract

Fashion image retrieval based on a query pair of reference image and natural language feedback is a challenging task that requires models to assess fashion related information from visual and textual modalities simultaneously. We propose a new vision-language transformer based model, FashionVLP, that brings the prior knowledge contained in large image-text corpora to the domain of fashion image re-trieval, and combines visual information from multiple levels of context to effectively capture fashion related information. While queries are encoded through the transformer layers, our asymmetric design adopts a novel attention-based approach for fusing target image features without involving text or transformer layers in the process. Extensive results show that FashionVLP achieves the state-of-the-art performance on benchmark datasets, with a large 23% relative improvement on the challenging FashionIQ dataset, which contains complex natural language feedback.

PDF CVPR Semantic Scholar

Cite

Text

Goenka et al. "FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01371

Markdown

[Goenka et al. "FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/goenka2022cvpr-fashionvlp/) doi:10.1109/CVPR52688.2022.01371

BibTeX

@inproceedings{goenka2022cvpr-fashionvlp,
  title     = {{FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback}},
  author    = {Goenka, Sonam and Zheng, Zhaoheng and Jaiswal, Ayush and Chada, Rakesh and Wu, Yue and Hedau, Varsha and Natarajan, Pradeep},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {14105-14115},
  doi       = {10.1109/CVPR52688.2022.01371},
  url       = {https://mlanthology.org/cvpr/2022/goenka2022cvpr-fashionvlp/}
}