Vision Learners Meet Web Image-Text Pairs
Abstract
Most recent self-supervised learning methods are pre-trained on the well-curated ImageNet-1K dataset. In this work, given the excellent scalability of web data, we consider self-supervised pre-training on noisy web sourced image-text paired data. First, we conduct a benchmark study of representative self-supervised pre-training methods on large-scale web data in a like-for-like setting. We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training. We observe that existing multi-modal methods do not outperform their single-modal counterparts on vision transfer learning tasks. We derive an information-theoretical view to explain these benchmark results, which provides insight into how to design a novel vision learner. Inspired by this insight, we present a new visual representation pre-training method, MUlti-modal Generator~(MUG), that learns from scalable web sourced image-text data. MUG achieves state-of-the-art transfer performance on a variety of tasks and demonstrates promising scaling properties. Pre-trained models and code will be made public upon acceptance.
Cite
Text
Zhao et al. "Vision Learners Meet Web Image-Text Pairs." Transactions on Machine Learning Research, 2024.Markdown
[Zhao et al. "Vision Learners Meet Web Image-Text Pairs." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/zhao2024tmlr-vision/)BibTeX
@article{zhao2024tmlr-vision,
title = {{Vision Learners Meet Web Image-Text Pairs}},
author = {Zhao, Bingchen and Cui, Quan and Wu, Hao and Yoshie, Osamu and Yang, Cheng and Aodha, Oisin Mac},
journal = {Transactions on Machine Learning Research},
year = {2024},
url = {https://mlanthology.org/tmlr/2024/zhao2024tmlr-vision/}
}