Kaleido-BERT: Vision-Language Pre-Training on Fashion Domain

Abstract

We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers. In contrast to random masking strategy of recent VL models, we design alignment guided masking to jointly focus more on image-text semantic relations. To this end, we carry out five novel tasks, i.e., rotation, jigsaw, camouflage, grey-to-color, and blank-to-color for self-supervised VL pre-training at patches of different scale. Kaleido-BERT is conceptually simple and easy to extend to the existing BERT framework, it attains new state-of-the-art results by large margins on four downstream tasks, including text retrieval (R@1: 4.03% absolute improvement), image retrieval (R@1: 7.13% abs imv.), category recognition (ACC: 3.28% abs imv.), and fashion captioning (Bleu4: 1.2 abs imv.). We validate the efficiency of Kaleido-BERT on a wide range of e-commerical websites, demonstrating its broader potential in real-world applications.

Cite

Text

Zhuge et al. "Kaleido-BERT: Vision-Language Pre-Training on Fashion Domain." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.01246

Markdown

[Zhuge et al. "Kaleido-BERT: Vision-Language Pre-Training on Fashion Domain." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/zhuge2021cvpr-kaleidobert/) doi:10.1109/CVPR46437.2021.01246

BibTeX

@inproceedings{zhuge2021cvpr-kaleidobert,
  title     = {{Kaleido-BERT: Vision-Language Pre-Training on Fashion Domain}},
  author    = {Zhuge, Mingchen and Gao, Dehong and Fan, Deng-Ping and Jin, Linbo and Chen, Ben and Zhou, Haoming and Qiu, Minghui and Shao, Ling},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {12647-12657},
  doi       = {10.1109/CVPR46437.2021.01246},
  url       = {https://mlanthology.org/cvpr/2021/zhuge2021cvpr-kaleidobert/}
}