OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Abstract

Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research.

Cite

Text

Li et al. "OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text." International Conference on Learning Representations, 2025.

Markdown

[Li et al. "OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/li2025iclr-omnicorpus/)

BibTeX

@inproceedings{li2025iclr-omnicorpus,
  title     = {{OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text}},
  author    = {Li, Qingyun and Chen, Zhe and Wang, Weiyun and Wang, Wenhai and Ye, Shenglong and Jin, Zhenjiang and Chen, Guanzhou and He, Yinan and Gao, Zhangwei and Cui, Erfei and Yu, Jiashuo and Tian, Hao and Zhou, Jiasheng and Xu, Chao and Wang, Bin and Wei, Xingjian and Li, Wei and Zhang, Wenjian and Zhang, Bo and Cai, Pinlong and Wen, Licheng and Yan, Xiangchao and Chu, Pei and Wang, Yi and Dou, Min and Tian, Changyao and Zhu, Xizhou and Lu, Lewei and Chen, Yushi and He, Junjun and Lu, Tong and Wang, Yali and Wang, Limin and Lin, Dahua and Qiao, Yu and Shi, Botian and He, Conghui and Dai, Jifeng},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/li2025iclr-omnicorpus/}
}