DataComp: In Search of the Next Generation of Multimodal Datasets

Abstract

Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. Our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release \datanet and all accompanying code at www.datacomp.ai.

Cite

Text

Gadre et al. "DataComp: In Search of the Next Generation of Multimodal Datasets." Neural Information Processing Systems, 2023.

Markdown

[Gadre et al. "DataComp: In Search of the Next Generation of Multimodal Datasets." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/gadre2023neurips-datacomp/)

BibTeX

@inproceedings{gadre2023neurips-datacomp,
  title     = {{DataComp: In Search of the Next Generation of Multimodal Datasets}},
  author    = {Gadre, Samir Yitzhak and Ilharco, Gabriel and Fang, Alex and Hayase, Jonathan and Smyrnis, Georgios and Nguyen, Thao and Marten, Ryan and Wortsman, Mitchell and Ghosh, Dhruba and Zhang, Jieyu and Orgad, Eyal and Entezari, Rahim and Daras, Giannis and Pratt, Sarah and Ramanujan, Vivek and Bitton, Yonatan and Marathe, Kalyani and Mussmann, Stephen and Vencu, Richard and Cherti, Mehdi and Krishna, Ranjay and Koh, Pang Wei W and Saukh, Olga and Ratner, Alexander J and Song, Shuran and Hajishirzi, Hannaneh and Farhadi, Ali and Beaumont, Romain and Oh, Sewoong and Dimakis, Alex and Jitsev, Jenia and Carmon, Yair and Shankar, Vaishaal and Schmidt, Ludwig},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/gadre2023neurips-datacomp/}
}