MAVEN: Multi-Modal Attention for Valence-Arousal Emotion Network
Abstract
Dynamic emotion recognition in the wild remains challenging due to the transient nature of emotional expressions and temporal misalignment of multi-modal cues. Traditional approaches predict valence and arousal and often overlook the inherent correlation between these two dimensions. The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities through a bi-directional cross-modal attention mechanism. MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts, predicting emotions in polar coordinates following Russell's circumplex model. The evaluation of the Aff-Wild2 dataset using MAVEN achieved a concordance correlation coefficient (CCC) of 0.3061, surpassing the ResNet-50 baseline model with a CCC of 0.22. The multistage architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition in real-world situations.
Cite
Text
Ahire et al. "MAVEN: Multi-Modal Attention for Valence-Arousal Emotion Network." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.Markdown
[Ahire et al. "MAVEN: Multi-Modal Attention for Valence-Arousal Emotion Network." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/ahire2025cvprw-maven/)BibTeX
@inproceedings{ahire2025cvprw-maven,
title = {{MAVEN: Multi-Modal Attention for Valence-Arousal Emotion Network}},
author = {Ahire, Vrushank and Shah, Kunal and Khan, Mudasir Nazir and Pakhale, Nikhil and Sookha, Lownish Rai and Ganaie, Mudasir Ahmad and Dhall, Abhinav},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2025},
pages = {5789-5799},
url = {https://mlanthology.org/cvprw/2025/ahire2025cvprw-maven/}
}