Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection

Lee, Wei-Yu; Jovanov, Ljubomir; Philips, Wilfried

doi:10.1007/978-3-031-25072-9_41

Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection

Wei-Yu Lee, Ljubomir Jovanov, Wilfried Philips

ECCVW 2022 pp. 608-623

doi:10.1007/978-3-031-25072-9_41 /eccvw/2022/lee2022eccvw-crossmodality/

Abstract

Pedestrian detection is an important challenge in computer vision due to its various applications. To achieve more accurate results, thermal images have been widely exploited as complementary information to assist conventional RGB-based detection. Although existing methods have developed numerous fusion strategies to utilize the complementary features, research that focuses on exploring features exclusive to each modality is limited. On this account, the features specific to one modality cannot be fully utilized and the fusion results could be easily dominated by the other modality, which limits the upper bound of discrimination ability. Hence, we propose the Cross-modality Attention Transformer (CAT) to explore the potential of modality-specific features. Further, we introduce the Multimodal Fusion Transformer (MFT) to identify the correlations between the modality data and perform feature fusion. In addition, a content-aware objective function is proposed to learn better feature representations. The experiments show that our method can achieve state-of-the-art detection performance on public datasets. The ablation studies also show the effectiveness of the proposed components.

PDF ECCVW Semantic Scholar

Cite

Text

Lee et al. "Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection." European Conference on Computer Vision Workshops, 2022. doi:10.1007/978-3-031-25072-9_41

Markdown

[Lee et al. "Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection." European Conference on Computer Vision Workshops, 2022.](https://mlanthology.org/eccvw/2022/lee2022eccvw-crossmodality/) doi:10.1007/978-3-031-25072-9_41

BibTeX

@inproceedings{lee2022eccvw-crossmodality,
  title     = {{Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection}},
  author    = {Lee, Wei-Yu and Jovanov, Ljubomir and Philips, Wilfried},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2022},
  pages     = {608-623},
  doi       = {10.1007/978-3-031-25072-9_41},
  url       = {https://mlanthology.org/eccvw/2022/lee2022eccvw-crossmodality/}
}