CountFormer: Multi-View Crowd Counting Transformer

Abstract

Multi-view counting (MVC) methods have shown their superiority over single-view counterparts, particularly in situations characterized by heavy occlusion and severe perspective distortions. However, hand-crafted heuristic features and identical camera layout requirements in conventional MVC methods limit their applicability and scalability in real-world scenarios. In this work, we propose a concise 3D MVC framework called CountFormer to elevate multi-view image-level features to a scene-level volume representation and estimate the 3D density map based on the volume features. By incorporating a camera encoding strategy, CountFormer successfully embeds camera parameters into the volume query and image-level features, enabling it to handle various camera layouts with significant differences. Furthermore, we introduce a feature lifting module capitalized on the attention mechanism to transform image-level features into a 3D volume representation for each camera view. Subsequently, the multi-view volume aggregation module attentively aggregates various multi-view volumes to create a comprehensive scene-level volume representation, allowing CountFormer to handle images captured by arbitrary dynamic camera layouts. The proposed method performs favorably against the state-of-the-art approaches across various widely used datasets, demonstrating its greater suitability for real-world deployment compared to conventional MVC frameworks.

Cite

Text

Mo et al. "CountFormer: Multi-View Crowd Counting Transformer." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72943-0_2

Markdown

[Mo et al. "CountFormer: Multi-View Crowd Counting Transformer." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/mo2024eccv-countformer/) doi:10.1007/978-3-031-72943-0_2

BibTeX

@inproceedings{mo2024eccv-countformer,
  title     = {{CountFormer: Multi-View Crowd Counting Transformer}},
  author    = {Mo, Hong and Zhang, Xiong and Tan, Jianchao and Yang, Cheng and Gu, Qiong and Hang, Bo and Ren, Wenqi},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72943-0_2},
  url       = {https://mlanthology.org/eccv/2024/mo2024eccv-countformer/}
}