Zero-Shot Improvement of Object Counting with CLIP

Abstract

We focus on the object counting limitations of vision-language models, with a particular emphasis on Contrastive Language-Image Pre-Training (CLIP) models. We assess the counting performance of CLIP using a custom dataset, which uncovers significant variations across diverse objects. To address this, we introduce a zero-shot, training-free method aimed at improving counting accuracy by manipulating the text embedding space of CLIP. Through comprehensive experiments, we demonstrate that our method not only enhances the counting capabilities of CLIP but also boosts the performance of text-to-image generative models like Stable Diffusion, particularly in generating images with precise object counts.

Cite

Text

Zhang et al. "Zero-Shot Improvement of Object Counting with CLIP." NeurIPS 2023 Workshops: R0-FoMo, 2023.

Markdown

[Zhang et al. "Zero-Shot Improvement of Object Counting with CLIP." NeurIPS 2023 Workshops: R0-FoMo, 2023.](https://mlanthology.org/neuripsw/2023/zhang2023neuripsw-zeroshot/)

BibTeX

@inproceedings{zhang2023neuripsw-zeroshot,
  title     = {{Zero-Shot Improvement of Object Counting with CLIP}},
  author    = {Zhang, Ruisu and Chen, Yicong and Lee, Kangwook},
  booktitle = {NeurIPS 2023 Workshops: R0-FoMo},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/zhang2023neuripsw-zeroshot/}
}