Multi-Modal Crowd Counting via a Broker Modality

Abstract

Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. This task is challenging due to the significant gap between these distinct modalities. In this paper, we propose a novel approach by introducing an auxiliary broker modality and on this basis frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models. Additionally, we identify and address the ghosting effect caused by direct cross-modal image fusion in multi-modal crowd counting. Through extensive experimental evaluations on popular multi-modal crowd counting datasets, we demonstrate the effectiveness of our method, which introduces only 4 million additional parameters, yet achieves promising results. The code is available at https://github.com/HenryCilence/Broker-Modality-Crowd-Counting.

Cite

Text

Meng et al. "Multi-Modal Crowd Counting via a Broker Modality." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72904-1_14

Markdown

[Meng et al. "Multi-Modal Crowd Counting via a Broker Modality." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/meng2024eccv-multimodal/) doi:10.1007/978-3-031-72904-1_14

BibTeX

@inproceedings{meng2024eccv-multimodal,
  title     = {{Multi-Modal Crowd Counting via a Broker Modality}},
  author    = {Meng, Haoliang and Hong, Xiaopeng and Wang, Chenhao and Shang, Miao and Zuo, Wangmeng},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72904-1_14},
  url       = {https://mlanthology.org/eccv/2024/meng2024eccv-multimodal/}
}