VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting

Kang, Seunggu; Moon, WonJun; Kim, Euiyeon; Heo, Jae-Pil

doi:10.1609/AAAI.V38I3.28050

VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting

Seunggu Kang, WonJun Moon, Euiyeon Kim, Jae-Pil Heo

AAAI 2024 pp. 2714-2722

doi:10.1609/AAAI.V38I3.28050 /aaai/2024/kang2024aaai-vlcounter/

Abstract

Zero-Shot Object Counting~(ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, we propose an one-stage baseline, Visual-Language Baseline (VLBase), exploring the implicit association of the semantic-patch embeddings of CLIP. Subsequently, we extend the VLBase to Visual-language Counter (VLCounter) by incorporating three modules devised to tailor VLBase for object counting. First, we introduce Semantic-conditioned Prompt Tuning (SPT) within the image encoder to acquire target-highlighted representations. Second, Learnable Affine Transformation (LAT) is employed to translate the semantic-patch similarity map to be appropriate for the counting task. Lastly, we transfer the layer-wisely encoded features to the decoder through Segment-aware Skip Connection (SaSC) to keep the generalization capability for unseen classes. Through extensive experiments on FSC147, CARPK, and PUCPR+, we demonstrate the benefits of our end-to-end framework, VLCounter. Code is available at https://github.com/seunggu0305/VLCounter

PDF AAAI Semantic Scholar

Cite

Text

Kang et al. "VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I3.28050

Markdown

[Kang et al. "VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/kang2024aaai-vlcounter/) doi:10.1609/AAAI.V38I3.28050

BibTeX

@inproceedings{kang2024aaai-vlcounter,
  title     = {{VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting}},
  author    = {Kang, Seunggu and Moon, WonJun and Kim, Euiyeon and Heo, Jae-Pil},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {2714-2722},
  doi       = {10.1609/AAAI.V38I3.28050},
  url       = {https://mlanthology.org/aaai/2024/kang2024aaai-vlcounter/}
}