Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding
Abstract
Self-supervised learning methods have demonstrated impressive performance across visual understanding tasks, including human behavior understanding. However, there has been limited work for self-supervised learning for egocentric social videos. Visual processing in such contexts faces several challenges, including noisy input, limited availability of egocentric social data, and the absence of pretrained models tailored to egocentric contexts. We propose , a novel framework leveraging novel-view face synthesis for dynamic perspective data augmentation from abundant exocentric videos and enhance self-supervised learning process for VideoMAE via: 1) reconstructing exocentric videos from masked dynamic perspective videos; and 2) predicting feature representations of a teacher model based on the corresponding exocentric frames. Experimental results demonstrate that consistently excels across diverse social role understanding tasks. It achieves state-of-the-art results in Ego4D’s Talk-to-me challenge (+0.7% mAP, +3.2% Accuracy). For the Look-at-me challenge, it achieves competitive performance with the state-of-the-art (-0.7% mAP, +1.5% Accuracy) without supervised training on external data. On the EasyCom dataset, our method surpasses both supervised Active Speaker Detection approaches and state-of-the-art video encoders (+1.2% mAP, +1.9% Accuracy compared to MARLIN).
Cite
Text
Tran et al. "Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72989-8_1Markdown
[Tran et al. "Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/tran2024eccv-ex2egmae/) doi:10.1007/978-3-031-72989-8_1BibTeX
@inproceedings{tran2024eccv-ex2egmae,
title = {{Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding}},
author = {Tran, Minh and Kim, Yelin and Su, Che-Chun and Sun, Min and Kuo, Cheng-Hao and Soleymani, Mohammad},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72989-8_1},
url = {https://mlanthology.org/eccv/2024/tran2024eccv-ex2egmae/}
}