Modality Mixer for Multi-Modal Action Recognition

Abstract

In multi-modal action recognition, it is important to consider not only the complementary nature of different modalities but also global action content. In this paper, we propose a novel network, named Modality Mixer (M-Mixer) network, to leverage complementary information across modalities and temporal context of an action for multi-modal action recognition. We also introduce a simple yet effective recurrent unit, called Multi-modal Contextualization Unit (MCU), which is a core component of M-Mixer. Our MCU temporally encodes a sequence of one modality (e.g., RGB) with action content features of other modalities (e.g., depth, IR). This process encourages M-Mixer to exploit global action content and also to supplement complementary information of other modalities. As a result, our proposed method outperforms state-of-the-art methods on NTU RGB+D 60, NTU RGB+D 120, and NWUCLA datasets. Moreover, we demonstrate the effectiveness of M-Mixer by conducting comprehensive ablation studies.

Cite

Text

Lee et al. "Modality Mixer for Multi-Modal Action Recognition." Winter Conference on Applications of Computer Vision, 2023.

Markdown

[Lee et al. "Modality Mixer for Multi-Modal Action Recognition." Winter Conference on Applications of Computer Vision, 2023.](https://mlanthology.org/wacv/2023/lee2023wacv-modality/)

BibTeX

@inproceedings{lee2023wacv-modality,
  title     = {{Modality Mixer for Multi-Modal Action Recognition}},
  author    = {Lee, Sumin and Woo, Sangmin and Park, Yeonju and Nugroho, Muhammad Adi and Kim, Changick},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2023},
  pages     = {3298-3307},
  url       = {https://mlanthology.org/wacv/2023/lee2023wacv-modality/}
}