Language-Guided Multi-Modal Emotional Mimicry Intensity Estimation
Abstract
Emotional Mimicry Intensity (EMI) estimation aims to identify the intensity of mimicry exhibited by individuals in response to observed emotions. The challenge in EMI estimation lies in discerning nuanced facial expression cues on mimicry behaviors based on the seed video and the text instructions. In this paper, we propose a multi-modal EMI estimation framework by leveraging visual, auditory, and textual modalities to capture a comprehensive emotional profile. We first extract representations for each modality separately and then fuse the modality-specific representations via a Temporal Segment Network, optimizing for temporal coherence and emotional context. Furthermore, we find that participants demonstrate notable proficiency in mimicking text instructions, yet exhibit less effectiveness in replicating facial expressions and vocal tones. In light of this, we design a contrastive learning mechanism to refine the extracted feature based on textual guidance. By doing so, features derived from similar text instructions are closely aligned, enhancing the estimation of emotional mimicry intensity by leveraging the dominant textual modality. Experiments conducted on the Hume-Vidmimic2 dataset illustrate the effectiveness of our framework in EMI estimation. Our framework is recognized as the leading solution in the Emotional Mimicry Intensity (EMI) Estimation Challenge at the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW). More information for the Competition can be found in: 6th ABAW.
Cite
Text
Qiu et al. "Language-Guided Multi-Modal Emotional Mimicry Intensity Estimation." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00477Markdown
[Qiu et al. "Language-Guided Multi-Modal Emotional Mimicry Intensity Estimation." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/qiu2024cvprw-languageguided/) doi:10.1109/CVPRW63382.2024.00477BibTeX
@inproceedings{qiu2024cvprw-languageguided,
title = {{Language-Guided Multi-Modal Emotional Mimicry Intensity Estimation}},
author = {Qiu, Feng and Zhang, Wei and Liu, Chen and Li, Lincheng and Du, Heming and Guo, Tianchen and Yu, Xin},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2024},
pages = {4742-4751},
doi = {10.1109/CVPRW63382.2024.00477},
url = {https://mlanthology.org/cvprw/2024/qiu2024cvprw-languageguided/}
}