Improving CLIP Counting Accuracy via Parameter-Efficient Fine-Tuning
Abstract
We focus on addressing the object counting limitations of vision-language models, with a particular emphasis on Contrastive Language-Image Pre-training (CLIP) models. Centered on our hypothesis that counting knowledge can be abstracted into linear vectors within the text embedding space, we develop a parameter-efficient fine-tuning method and several zero-shot methods to improve CLIP's counting accuracy. Through comprehensive experiments, we demonstrate that our learning-based method not only outperforms full-model fine-tuning in counting accuracy but also retains the broad capabilities of pre-trained CLIP models. Our zero-shot text embedding editing techniques are also effective in situations where training data is scarce, and can be extended to improve Stable Diffusion's ability to generate images with precise object counts. We also contribute two specialized datasets to train and evaluate CLIP’s counting capabilities. Our code is available at https://github.com/UW-Madison-Lee-Lab/CLIP_Counting.
Cite
Text
Zhang et al. "Improving CLIP Counting Accuracy via Parameter-Efficient Fine-Tuning." Transactions on Machine Learning Research, 2025.Markdown
[Zhang et al. "Improving CLIP Counting Accuracy via Parameter-Efficient Fine-Tuning." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/zhang2025tmlr-improving/)BibTeX
@article{zhang2025tmlr-improving,
title = {{Improving CLIP Counting Accuracy via Parameter-Efficient Fine-Tuning}},
author = {Zhang, Ruisu and Chen, Yicong and Lee, Kangwook},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/zhang2025tmlr-improving/}
}