MAVEN: Multi-Modal Attention for Valence-Arousal Emotion Network

Abstract

Dynamic emotion recognition in the wild remains challenging due to the transient nature of emotional expressions and temporal misalignment of multi-modal cues. Traditional approaches predict valence and arousal and often overlook the inherent correlation between these two dimensions. The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities through a bi-directional cross-modal attention mechanism. MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts, predicting emotions in polar coordinates following Russell's circumplex model. The evaluation of the Aff-Wild2 dataset using MAVEN achieved a concordance correlation coefficient (CCC) of 0.3061, surpassing the ResNet-50 baseline model with a CCC of 0.22. The multistage architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition in real-world situations.

Cite

Text

Ahire et al. "MAVEN: Multi-Modal Attention for Valence-Arousal Emotion Network." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Ahire et al. "MAVEN: Multi-Modal Attention for Valence-Arousal Emotion Network." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/ahire2025cvprw-maven/)

BibTeX

@inproceedings{ahire2025cvprw-maven,
  title     = {{MAVEN: Multi-Modal Attention for Valence-Arousal Emotion Network}},
  author    = {Ahire, Vrushank and Shah, Kunal and Khan, Mudasir Nazir and Pakhale, Nikhil and Sookha, Lownish Rai and Ganaie, Mudasir Ahmad and Dhall, Abhinav},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {5789-5799},
  url       = {https://mlanthology.org/cvprw/2025/ahire2025cvprw-maven/}
}