T2ICount: Enhancing Cross-Modal Understanding for Zero-Shot Counting

Abstract

Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denoising U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available at https://github.com/cha15yq/T2ICount

Cite

Text

Qian et al. "T2ICount: Enhancing Cross-Modal Understanding for Zero-Shot Counting." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02359

Markdown

[Qian et al. "T2ICount: Enhancing Cross-Modal Understanding for Zero-Shot Counting." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/qian2025cvpr-t2icount/) doi:10.1109/CVPR52734.2025.02359

BibTeX

@inproceedings{qian2025cvpr-t2icount,
  title     = {{T2ICount: Enhancing Cross-Modal Understanding for Zero-Shot Counting}},
  author    = {Qian, Yifei and Guo, Zhongliang and Deng, Bowen and Lei, Chun Tong and Zhao, Shuai and Lau, Chun Pong and Hong, Xiaopeng and Pound, Michael P.},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {25336-25345},
  doi       = {10.1109/CVPR52734.2025.02359},
  url       = {https://mlanthology.org/cvpr/2025/qian2025cvpr-t2icount/}
}