Integrating Holistic and Local Information to Estimate Emotional Reaction Intensity

Abstract

Video-based Emotional Reaction Intensity (ERI) estimation measures the intensity of subjects’ reactions to stimuli along several emotional dimensions from videos of the subject as they view the stimuli. We propose a multi-modal architecture for video-based ERI combining video and audio information. Video input is encoded spatially first, frame-by-frame, combining features encoding holistic aspects of the subjects’ facial expressions and features encoding spatially localized aspects of their expressions. Input is then combined across time: from frame-to-frame using gated recurrent units (GRUs), then globally by a transformer. We handle variable video length with a regression token that accumulates information from all frames into a fixed-dimensional vector independent of video length. Audio information is handled similarly: spectral information extracted within each frame is integrated across time by a cascade of GRUs and a transformer with regression token. The video and audio regression tokens’ outputs are merged by concatenation, then input to a final fully connected layer producing intensity estimates. Our architecture achieved excellent performance on the Hume-Reaction dataset in the ERI Esimation Challenge of the Fifth Competition on Affective Behavior Analysis in-the-Wild (ABAW5). The Pearson Correlation Coefficients between estimated and subject self-reported scores, averaged across all emotions, were 0.455 on the validation dataset and 0.4547 on the test dataset, well above the baselines. The transformer’s self-attention mechanism enables our architecture to focus on the most critical video frames regardless of length. Ablation experiments establish the advantages of combining holistic/local features and of multi-modal integration. Code available at https://github.com/HKUST-NISL/ABAW5.

Cite

Text

Fang et al. "Integrating Holistic and Local Information to Estimate Emotional Reaction Intensity." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023. doi:10.1109/CVPRW59228.2023.00631

Markdown

[Fang et al. "Integrating Holistic and Local Information to Estimate Emotional Reaction Intensity." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023.](https://mlanthology.org/cvprw/2023/fang2023cvprw-integrating/) doi:10.1109/CVPRW59228.2023.00631

BibTeX

@inproceedings{fang2023cvprw-integrating,
  title     = {{Integrating Holistic and Local Information to Estimate Emotional Reaction Intensity}},
  author    = {Fang, Yini and Wu, Liang and Jumelle, Frederic and Shi, Bertram E.},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2023},
  pages     = {5934-5939},
  doi       = {10.1109/CVPRW59228.2023.00631},
  url       = {https://mlanthology.org/cvprw/2023/fang2023cvprw-integrating/}
}