Multimodal C4: An Open, Billion-Scale Corpus of Images Interleaved with Text

Abstract

In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input.This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., ``What do image A and image B have in common?''To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text.To date, however, large-scale data of this form have not been publicly available.We release Multimodal C4, an augmentation of the popular text-only C4 corpus with images interleaved.We use a linear assignment algorithm to place images into longer bodies of text using CLIP features, a process that we show outperforms alternatives.Multimodal C4 spans everyday topics like cooking, travel, technology, etc. A manual inspection of a random sample of documents shows that a vast majority (88\%) of images are topically relevant, and that linear assignment frequently selects individual sentences specifically well-aligned with each image (80\%). After filtering NSFW images, ads, etc., the resulting corpus consists of 101.2M documents with 571M images interleaved in 43B English tokens.

Cite

Text

Zhu et al. "Multimodal C4: An Open, Billion-Scale Corpus of Images Interleaved with Text." Neural Information Processing Systems, 2023.

Markdown

[Zhu et al. "Multimodal C4: An Open, Billion-Scale Corpus of Images Interleaved with Text." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/zhu2023neurips-multimodal/)

BibTeX

@inproceedings{zhu2023neurips-multimodal,
  title     = {{Multimodal C4: An Open, Billion-Scale Corpus of Images Interleaved with Text}},
  author    = {Zhu, Wanrong and Hessel, Jack and Awadalla, Anas and Gadre, Samir Yitzhak and Dodge, Jesse and Fang, Alex and Yu, Youngjae and Schmidt, Ludwig and Wang, William Yang and Choi, Yejin},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/zhu2023neurips-multimodal/}
}