Joint Multimodal Transformer for Emotion Recognition in the Wild

Abstract

Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter-and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multi-modal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, our JMT fusion architecture integrates the individual modality embeddings, allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks – (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) – indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods. Code is available at: https://github.com/PoloWlg/Joint-Multimodal-Transformer-6th-ABAW

Cite

Text

Waligora et al. "Joint Multimodal Transformer for Emotion Recognition in the Wild." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00465

Markdown

[Waligora et al. "Joint Multimodal Transformer for Emotion Recognition in the Wild." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/waligora2024cvprw-joint/) doi:10.1109/CVPRW63382.2024.00465

BibTeX

@inproceedings{waligora2024cvprw-joint,
  title     = {{Joint Multimodal Transformer for Emotion Recognition in the Wild}},
  author    = {Waligora, Paul and Aslam, Muhammad Haseeb and Zeeshan, Muhammad Osama and Belharbi, Soufiane and Koerich, Alessandro Lameiras and Pedersoli, Marco and Bacon, Simon and Granger, Eric},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {4625-4635},
  doi       = {10.1109/CVPRW63382.2024.00465},
  url       = {https://mlanthology.org/cvprw/2024/waligora2024cvprw-joint/}
}