EVE: Efficient Vision-Language Pre-Training with Masked Prediction and Modality-Aware MoE

Chen, Junyi; Guo, Longteng; Sun, Jia; Shao, Shuai; Yuan, Zehuan; Lin, Liang; Zhang, Dongyu

doi:10.1609/AAAI.V38I2.27872

EVE: Efficient Vision-Language Pre-Training with Masked Prediction and Modality-Aware MoE

Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, Dongyu Zhang

AAAI 2024 pp. 1110-1119

doi:10.1609/AAAI.V38I2.27872 /aaai/2024/chen2024aaai-eve/

Abstract

Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 4x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.

PDF AAAI Semantic Scholar

Cite

Text

Chen et al. "EVE: Efficient Vision-Language Pre-Training with Masked Prediction and Modality-Aware MoE." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I2.27872

Markdown

[Chen et al. "EVE: Efficient Vision-Language Pre-Training with Masked Prediction and Modality-Aware MoE." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/chen2024aaai-eve/) doi:10.1609/AAAI.V38I2.27872

BibTeX

@inproceedings{chen2024aaai-eve,
  title     = {{EVE: Efficient Vision-Language Pre-Training with Masked Prediction and Modality-Aware MoE}},
  author    = {Chen, Junyi and Guo, Longteng and Sun, Jia and Shao, Shuai and Yuan, Zehuan and Lin, Liang and Zhang, Dongyu},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {1110-1119},
  doi       = {10.1609/AAAI.V38I2.27872},
  url       = {https://mlanthology.org/aaai/2024/chen2024aaai-eve/}
}