Video Representation Learning for Conversational Facial Expression Recognition Guided by Multiple View Reconstruction

Abstract

Conversational facial expression recognition entails challenges such as handling of facial dynamics, small available datasets, low-intensity and fine-grained emotional expressions and extreme face angle. Towards addressing these challenges, we propose the Masking Action Units and Reconstructing multiple Angles (MAURA) pre-training. MAURA is an efficient self-supervised method that permits the use of small datasets, while preserving end-to-end conversational facial expression recognition with Vision Transformer. MAURA masks videos using the location with active Action Units and reconstructs synchronized multi-view videos, thus learning the dependencies between muscle movements and encoding information, which might only be visible in few frames and/or in certain views. Based on one view (e.g., frontal), the encoder reconstructs other views (e.g., top, down, laterals). Such masking and reconstructing strategy provides a powerful representation, beneficial in facial expression downstream tasks. Our experimental analysis shows that we consistently outperform the state-of-the-art in the challenging settings of low-intensity and fine-grained conversational facial expression recognition on four datasets including in-the-wild DFEW, CMU-MOSEI, MFA and multi-view MEAD. Our results suggest that MAURA is able to learn robust and generic video representations.

Cite

Text

Strizhkova et al. "Video Representation Learning for Conversational Facial Expression Recognition Guided by Multiple View Reconstruction." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00472

Markdown

[Strizhkova et al. "Video Representation Learning for Conversational Facial Expression Recognition Guided by Multiple View Reconstruction." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/strizhkova2024cvprw-video/) doi:10.1109/CVPRW63382.2024.00472

BibTeX

@inproceedings{strizhkova2024cvprw-video,
  title     = {{Video Representation Learning for Conversational Facial Expression Recognition Guided by Multiple View Reconstruction}},
  author    = {Strizhkova, Valeriya and Ferrari, Laura M. and Kachmar, Hadi and Dantcheva, Antitza and Brémond, François},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {4693-4702},
  doi       = {10.1109/CVPRW63382.2024.00472},
  url       = {https://mlanthology.org/cvprw/2024/strizhkova2024cvprw-video/}
}