SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-Training

Abstract

Video-language pre-training is crucial for learning powerful multi-modal representation. However, it typically requires a massive amount of computation. In this paper, we develop SMAUG, an efficient pre-training framework for video-language models. The foundation component in SMAUG is masked autoencoders. Different from prior works which only mask textual inputs, our masking strategy considers both visual and textual modalities, providing a better cross-modal alignment and saving more pre-training costs. On top of that, we introduce a space-time token sparsification module, which leverages context information to further select only "important" spatial regions and temporal frames for pre-training. Coupling all these designs allows our method to enjoy both competitive performances on text-to-video retrieval and video question answering tasks, and much less pre-training costs by 1.9x or more. For example, our SMAUG only needs 50 NVIDIA A6000 GPU hours for pre-training to attain competitive performances on these two video-language tasks across six popular benchmarks.

Cite

Text

Lin et al. "SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-Training." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00233

Markdown

[Lin et al. "SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-Training." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/lin2023iccv-smaug/) doi:10.1109/ICCV51070.2023.00233

BibTeX

@inproceedings{lin2023iccv-smaug,
  title     = {{SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-Training}},
  author    = {Lin, Yuanze and Wei, Chen and Wang, Huiyu and Yuille, Alan and Xie, Cihang},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {2459-2469},
  doi       = {10.1109/ICCV51070.2023.00233},
  url       = {https://mlanthology.org/iccv/2023/lin2023iccv-smaug/}
}