OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Abstract
Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train on the dataset vision and language models of 9 and 80 billion parameters, IDEFICS-9B and IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.
Cite
Text
Laurençon et al. "OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents." Neural Information Processing Systems, 2023.Markdown
[Laurençon et al. "OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/laurencon2023neurips-obelics/)BibTeX
@inproceedings{laurencon2023neurips-obelics,
title = {{OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents}},
author = {Laurençon, Hugo and Saulnier, Lucile and Tronchon, Leo and Bekman, Stas and Singh, Amanpreet and Lozhkov, Anton and Wang, Thomas and Karamcheti, Siddharth and Rush, Alexander and Kiela, Douwe and Cord, Matthieu and Sanh, Victor},
booktitle = {Neural Information Processing Systems},
year = {2023},
url = {https://mlanthology.org/neurips/2023/laurencon2023neurips-obelics/}
}