Parrot Captions Teach CLIP to Spot Text

Lin, Yiqi; He, Conghui; Wang, Alex Jinpeng; Wang, Bin; Li, Weijia; Shou, Mike Zheng

doi:10.1007/978-3-031-72946-1_21

Parrot Captions Teach CLIP to Spot Text

Yiqi Lin, Conghui He, Alex Jinpeng Wang, Bin Wang, Weijia Li, Mike Zheng Shou

ECCV 2024

doi:10.1007/978-3-031-72946-1_21 /eccv/2024/lin2024eccv-parrot/

Abstract

Despite CLIP [?] being the foundation model in numerous vision-language applications, CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to ‘Parrot’ the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B [?], the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content and around 30% of captions words are concurrently embedded in the visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is a dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models across various vision-language downstream tasks. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering. Project page: https://linyq17.github.io/CLIP-Parrot-Bias/

PDF ECCV Semantic Scholar

Cite

Text

Lin et al. "Parrot Captions Teach CLIP to Spot Text." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72946-1_21

Markdown

[Lin et al. "Parrot Captions Teach CLIP to Spot Text." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/lin2024eccv-parrot/) doi:10.1007/978-3-031-72946-1_21

BibTeX

@inproceedings{lin2024eccv-parrot,
  title     = {{Parrot Captions Teach CLIP to Spot Text}},
  author    = {Lin, Yiqi and He, Conghui and Wang, Alex Jinpeng and Wang, Bin and Li, Weijia and Shou, Mike Zheng},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72946-1_21},
  url       = {https://mlanthology.org/eccv/2024/lin2024eccv-parrot/}
}