Free Lunch Enhancements for Multi-Modal Crowd Counting

Abstract

This paper addresses multi-modal crowd counting with a novel `free lunch' training enhancement strategy that requires no additional data, parameters, or increased inference complexity. First, we introduce a cross-modal alignment technique as a plug-in post-processing step for the pre-trained backbone network, enhancing the model's ability to capture shared information across modalities. Second, we incorporate a regional density supervision mechanism during the fine-tuning stage, which differentiates features in regions with varying crowd densities. Extensive experiments on three multi-modal crowd counting datasets validate our approach, making it the first to achieve an MAE below 10 on RGBT-CC. The code is available at https://github.com/HenryCilence/Free-Lunch-Multimodal-Counting.

Cite

Text

Meng et al. "Free Lunch Enhancements for Multi-Modal Crowd Counting." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01308

Markdown

[Meng et al. "Free Lunch Enhancements for Multi-Modal Crowd Counting." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/meng2025cvpr-free/) doi:10.1109/CVPR52734.2025.01308

BibTeX

@inproceedings{meng2025cvpr-free,
  title     = {{Free Lunch Enhancements for Multi-Modal Crowd Counting}},
  author    = {Meng, Haoliang and Hong, Xiaopeng and Lai, Zhengqin and Shang, Miao},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {14013-14023},
  doi       = {10.1109/CVPR52734.2025.01308},
  url       = {https://mlanthology.org/cvpr/2025/meng2025cvpr-free/}
}