A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
Abstract
Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary modalities, such as audio, visual, and biosignals. However, most state-of-the- art audio-visual (A-V) fusion methods rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. This paper focuses on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. We propose a joint cross-attention fusion model that can effectively exploit the complementary inter-modal relationships, allowing for an accurate prediction of valence and arousal. In particular, this model computes cross-attention weights based on the correlation between joint feature representations and individual modalities. By deploying a joint A-V feature representation into the cross-attention module, the performance of our fusion model improves significantly over the vanilla cross-attention module. Experimental results1 on the AffWild2 dataset highlight the robustness of our proposed A-V fusion model. It has achieved a concordance correlation coefficient (CCC) of 0.374 (0.663) and 0.363 (0.584) for valence and arousal, respectively, on the test set (validation set). This represents a significant improvement over the baseline for the third challenge of Affective Behavior Analysis in-the-Wild 2022 (ABAW3) competition, with a CCC of 0.180 (0.310) and 0.170 (0.170).
Cite
Text
Praveen et al. "A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022. doi:10.1109/CVPRW56347.2022.00278Markdown
[Praveen et al. "A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022.](https://mlanthology.org/cvprw/2022/praveen2022cvprw-joint/) doi:10.1109/CVPRW56347.2022.00278BibTeX
@inproceedings{praveen2022cvprw-joint,
title = {{A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition}},
author = {Praveen, R. Gnana and de Melo, Wheidima Carneiro and Ullah, Nasib and Aslam, Haseeb and Zeeshan, Osama and Denorme, Théo and Pedersoli, Marco and Koerich, Alessandro L. and Bacon, Simon and Cardinal, Patrick and Granger, Eric},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2022},
pages = {2485-2494},
doi = {10.1109/CVPRW56347.2022.00278},
url = {https://mlanthology.org/cvprw/2022/praveen2022cvprw-joint/}
}