ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion

Abstract

Traditional image-to-image and text-to-image search struggle with comprehending complex user intentions, particularly in fashion e-commerce, where users search for similar products with text modifications to a reference image. This paper introduces Progressive Vision-Language Alignment and Multimodal Fusion (ProVLA), a novel approach which utilizes a transformer-based vision and language model to generate multimodal embeddings. Our method involves a two-step learning process and a cross-attention-based fusion encoder to facilitate robust information fusion, and a momentum queue-based hard negative mining mechanism to handle noisy training data. Extensive evaluations on the Fashion 200k and Shoes benchmark datasets demonstrate that our model outperforms state-of-the-art methods.

Cite

Text

Hu et al. "ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion." IEEE/CVF International Conference on Computer Vision Workshops, 2023. doi:10.1109/ICCVW60793.2023.00293

Markdown

[Hu et al. "ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion." IEEE/CVF International Conference on Computer Vision Workshops, 2023.](https://mlanthology.org/iccvw/2023/hu2023iccvw-provla/) doi:10.1109/ICCVW60793.2023.00293

BibTeX

@inproceedings{hu2023iccvw-provla,
  title     = {{ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion}},
  author    = {Hu, Zhizhang and Zhu, Xinliang and Tran, Son and Vidal, René and Dhua, Arnab},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2023},
  pages     = {2764-2769},
  doi       = {10.1109/ICCVW60793.2023.00293},
  url       = {https://mlanthology.org/iccvw/2023/hu2023iccvw-provla/}
}