FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback
Abstract
Fashion image retrieval based on a query pair of reference image and natural language feedback is a challenging task that requires models to assess fashion related information from visual and textual modalities simultaneously. We propose a new vision-language transformer based model, FashionVLP, that brings the prior knowledge contained in large image-text corpora to the domain of fashion image re-trieval, and combines visual information from multiple levels of context to effectively capture fashion related information. While queries are encoded through the transformer layers, our asymmetric design adopts a novel attention-based approach for fusing target image features without involving text or transformer layers in the process. Extensive results show that FashionVLP achieves the state-of-the-art performance on benchmark datasets, with a large 23% relative improvement on the challenging FashionIQ dataset, which contains complex natural language feedback.
Cite
Text
Goenka et al. "FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01371Markdown
[Goenka et al. "FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/goenka2022cvpr-fashionvlp/) doi:10.1109/CVPR52688.2022.01371BibTeX
@inproceedings{goenka2022cvpr-fashionvlp,
title = {{FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback}},
author = {Goenka, Sonam and Zheng, Zhaoheng and Jaiswal, Ayush and Chada, Rakesh and Wu, Yue and Hedau, Varsha and Natarajan, Pradeep},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2022},
pages = {14105-14115},
doi = {10.1109/CVPR52688.2022.01371},
url = {https://mlanthology.org/cvpr/2022/goenka2022cvpr-fashionvlp/}
}