AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts

Abstract

Leveraging the synergy of both audio data and visual data is essential for understanding human emotions and behaviors, especially in in-the-wild setting. Traditional methods for integrating such multimodal information often stumble, leading to less-than-ideal outcomes in the task of facial action unit detection. Addressing these challenges, our study introduces a novel approach that synergistically enhances audio-visual data processing. For audio, we employ Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel spectrogram features, enriched through a pre-trained VGGish network, significantly bolstering the audio feature landscape. Concurrently, in the visual spectrum, we enhance feature extraction using an iResNet model pre-trained on facial datasets, thereby improving the robustness and quality of the visual data representation. With this augmented feature set, Temporal Convolutional Networks (TCN) are applied to meticulously extract and analyze time-series characteristics within each modality, fostering a nuanced understanding of temporal dynamics. The integration of cross-modal information is then achieved through a fine-tuned pre-trained GPT-2 model, facilitating sophisticated and context-aware fusion of the multimodal data. This comprehensive approach not only enhances the accuracy of AU detection but also paves the way for a nuanced comprehension of complex emotional and behavioral expressions in real-world scenarios.

Cite

Text

Yu et al. "AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00484

Markdown

[Yu et al. "AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/yu2024cvprw-audtgn/) doi:10.1109/CVPRW63382.2024.00484

BibTeX

@inproceedings{yu2024cvprw-audtgn,
  title     = {{AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts}},
  author    = {Yu, Jun and Zhang, Zerui and Wei, Zhihong and Zhao, Gongpeng and Cai, Zhongpeng and Wang, Yongqi and Xie, Guochen and Zhu, Jichao and Zhu, Wangyuan and Liu, Qingsong and Liang, Jiaen},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {4814-4821},
  doi       = {10.1109/CVPRW63382.2024.00484},
  url       = {https://mlanthology.org/cvprw/2024/yu2024cvprw-audtgn/}
}