Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos

Abstract

With the assumption that a video dataset is multimodality annotated in which auditory and visual modalities both are labeled or class-relevant, current multimodal methods apply modality fusion or cross-modality attention. However, effectively leveraging the audio modality in vision-specific annotated videos for action recognition is of particular challenge. To tackle this challenge, we propose a novel audio-visual framework that effectively leverages the audio modality in any solely vision-specific annotated dataset. We adopt the language models (e.g., BERT) to build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels in which SAVLD serves as a bridge between audio and video datasets. Then, SAVLD along with a pretrained audio multi-label model are used to estimate the audio-visual modality relevance during the training phase. Accordingly, a novel learnable irrelevant modality dropout (IMD) is proposed to completely drop out the irrelevant audio modality and fuse only the relevant modalities. Moreover, we present a new two-stream video Transformer for efficiently modeling the visual modalities. Results on several vision-specific annotated datasets including Kinetics400 and UCF-101 validated our framework as it outperforms most relevant action recognition methods.

Cite

Text

Alfasly et al. "Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01957

Markdown

[Alfasly et al. "Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/alfasly2022cvpr-learnable/) doi:10.1109/CVPR52688.2022.01957

BibTeX

@inproceedings{alfasly2022cvpr-learnable,
  title     = {{Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos}},
  author    = {Alfasly, Saghir and Lu, Jian and Xu, Chen and Zou, Yuru},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {20208-20217},
  doi       = {10.1109/CVPR52688.2022.01957},
  url       = {https://mlanthology.org/cvpr/2022/alfasly2022cvpr-learnable/}
}