Conceptual 12m: Pushing Web-Scale Image-Text Pre-Training to Recognize Long-Tail Visual Concepts

Abstract

The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

Cite

Text

Changpinyo et al. "Conceptual 12m: Pushing Web-Scale Image-Text Pre-Training to Recognize Long-Tail Visual Concepts." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00356

Markdown

[Changpinyo et al. "Conceptual 12m: Pushing Web-Scale Image-Text Pre-Training to Recognize Long-Tail Visual Concepts." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/changpinyo2021cvpr-conceptual/) doi:10.1109/CVPR46437.2021.00356

BibTeX

@inproceedings{changpinyo2021cvpr-conceptual,
  title     = {{Conceptual 12m: Pushing Web-Scale Image-Text Pre-Training to Recognize Long-Tail Visual Concepts}},
  author    = {Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {3558-3568},
  doi       = {10.1109/CVPR46437.2021.00356},
  url       = {https://mlanthology.org/cvpr/2021/changpinyo2021cvpr-conceptual/}
}