Multimodal Continuous Emotion Recognition: A Technical Report for ABAW5

Abstract

We used two multimodal models for continuous valence-arousal recognition using visual, audio, and linguistic information. The first model is the same as we used in ABAW2 and ABAW3, which employs the leader-follower attention. The second model has the same architecture for spatial and temporal encoding. As for the fusion block, it employs a compact and straightforward channel attention, borrowed from the End2You toolkit. Unlike our previous attempts that use Vggish feature directly as the audio feature, this time we feed the pre-trained VGG model using logmel-spectrogram and finetune it during the training. To make full use of the data and alleviate over-fitting, cross-validation is carried out. The code is available at https://github.com/sucv/ABAW3.

Cite

Text

Zhang et al. "Multimodal Continuous Emotion Recognition: A Technical Report for ABAW5." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023. doi:10.1109/CVPRW59228.2023.00611

Markdown

[Zhang et al. "Multimodal Continuous Emotion Recognition: A Technical Report for ABAW5." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023.](https://mlanthology.org/cvprw/2023/zhang2023cvprw-multimodal-a/) doi:10.1109/CVPRW59228.2023.00611

BibTeX

@inproceedings{zhang2023cvprw-multimodal-a,
  title     = {{Multimodal Continuous Emotion Recognition: A Technical Report for ABAW5}},
  author    = {Zhang, Su and Zhao, Ziyuan and Guan, Cuntai},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2023},
  pages     = {5764-5769},
  doi       = {10.1109/CVPRW59228.2023.00611},
  url       = {https://mlanthology.org/cvprw/2023/zhang2023cvprw-multimodal-a/}
}