Contrastive Vision-Language Pre-Training with Limited Resources

Quan Cui, Boyan Zhou, Yu Guo, Weidong Yin, Hao Wu, Osamu Yoshie, Yubo Chen

ECCV 2022

doi:10.1007/978-3-031-20059-5_14 /eccv/2022/cui2022eccv-contrastive/

Abstract

Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have revealed the potential of aligning multi-modal representations with contrastive learning. However, these works require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we propose a stack of novel methods, which significantly cut down the heavy resource dependency and allow us to conduct dual-encoder multi-modal representation alignment with limited resources. Besides, we provide a reproducible baseline of competitive results, namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100 GPUs. Additionally, we collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods, further proving the effectiveness of our methods on large-scale data. We hope that this work will provide useful data points and experience for future research in contrastive vision-language pre-training. Code is available at https://github.com/zerovl/ZeroVL.

PDF ECCV Semantic Scholar

Cite

Text

Cui et al. "Contrastive Vision-Language Pre-Training with Limited Resources." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-20059-5_14

Markdown

[Cui et al. "Contrastive Vision-Language Pre-Training with Limited Resources." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/cui2022eccv-contrastive/) doi:10.1007/978-3-031-20059-5_14

BibTeX

@inproceedings{cui2022eccv-contrastive,
  title     = {{Contrastive Vision-Language Pre-Training with Limited Resources}},
  author    = {Cui, Quan and Zhou, Boyan and Guo, Yu and Yin, Weidong and Wu, Hao and Yoshie, Osamu and Chen, Yubo},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-20059-5_14},
  url       = {https://mlanthology.org/eccv/2022/cui2022eccv-contrastive/}
}