ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders
Abstract
We propose , a model that combines both Masked AutoEncoders (MAE) and contrastive learning. is trained using a global representation obtained by pooling the local features learned under an MAE reconstruction loss and using this representation under a contrastive objective across images and video frames. We show that visual representations learned under generalize well to video and image classification tasks. Particularly, obtains state-of-the-art transfer learning performance from video to images on Imagenet-1k compared to the recently proposed OmniMAE by achieving a top-1 accuracy of 86% (+1.3% absolute improvement) when trained on the same data and 87.1% (+2.4% absolute improvement) when training on extra data. At the same time, outperforms most other methods on video benchmarks by obtaining 75.9% top-1 accuracy on the challenging Something something-v2 video benchmark. When training on videos and images from diverse datasets, our method maintains a balanced transfer-learning performance between video and image classification benchmarks, coming only as a close second to the best-supervised method.
Cite
Text
Hernandez et al. "ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73235-5_25Markdown
[Hernandez et al. "ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/hernandez2024eccv-vicmae/) doi:10.1007/978-3-031-73235-5_25BibTeX
@inproceedings{hernandez2024eccv-vicmae,
title = {{ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders}},
author = {Hernandez, Jefferson and Villegas, Ruben and Ordonez, Vicente},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-73235-5_25},
url = {https://mlanthology.org/eccv/2024/hernandez2024eccv-vicmae/}
}