Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding

Minh Tran, Yelin Kim, Che-Chun Su, Min Sun, Cheng-Hao Kuo, Mohammad Soleymani

ECCV 2024

doi:10.1007/978-3-031-72989-8_1 /eccv/2024/tran2024eccv-ex2egmae/

Abstract

Self-supervised learning methods have demonstrated impressive performance across visual understanding tasks, including human behavior understanding. However, there has been limited work for self-supervised learning for egocentric social videos. Visual processing in such contexts faces several challenges, including noisy input, limited availability of egocentric social data, and the absence of pretrained models tailored to egocentric contexts. We propose , a novel framework leveraging novel-view face synthesis for dynamic perspective data augmentation from abundant exocentric videos and enhance self-supervised learning process for VideoMAE via: 1) reconstructing exocentric videos from masked dynamic perspective videos; and 2) predicting feature representations of a teacher model based on the corresponding exocentric frames. Experimental results demonstrate that consistently excels across diverse social role understanding tasks. It achieves state-of-the-art results in Ego4D’s Talk-to-me challenge (+0.7% mAP, +3.2% Accuracy). For the Look-at-me challenge, it achieves competitive performance with the state-of-the-art (-0.7% mAP, +1.5% Accuracy) without supervised training on external data. On the EasyCom dataset, our method surpasses both supervised Active Speaker Detection approaches and state-of-the-art video encoders (+1.2% mAP, +1.9% Accuracy compared to MARLIN).

PDF ECCV Semantic Scholar

Cite

Text

Tran et al. "Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72989-8_1

Markdown

[Tran et al. "Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/tran2024eccv-ex2egmae/) doi:10.1007/978-3-031-72989-8_1

BibTeX

@inproceedings{tran2024eccv-ex2egmae,
  title     = {{Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding}},
  author    = {Tran, Minh and Kim, Yelin and Su, Che-Chun and Sun, Min and Kuo, Cheng-Hao and Soleymani, Mohammad},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72989-8_1},
  url       = {https://mlanthology.org/eccv/2024/tran2024eccv-ex2egmae/}
}