GneissWeb: Preparing High Quality Data for LLMs at Scale
Abstract
Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. In this paper, we introduce **GneissWeb**, a large dataset of around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb goes beyond simple model-based quality filtering used in recent datasets by designing an ensemble of filters incorporating novel quality filters. Novel components enable us to achieve a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens). We show that models trained using GneissWeb outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average scores on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points gain over those trained on FineWeb-V1.1.0.
Cite
Text
Gohari et al. "GneissWeb: Preparing High Quality Data for LLMs at Scale." International Conference on Learning Representations, 2026.Markdown
[Gohari et al. "GneissWeb: Preparing High Quality Data for LLMs at Scale." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/gohari2026iclr-gneissweb/)BibTeX
@inproceedings{gohari2026iclr-gneissweb,
title = {{GneissWeb: Preparing High Quality Data for LLMs at Scale}},
author = {Gohari, Hajar Emami and Kadhe, Swanand Ravindra and Shah, Yousaf and Adam, Constantin M and Adebayo, Abdulhamid and Adusumilli, Praneet and Ahmed, Farhan and Baracaldo, Nathalie and Borse, Santosh Subhashrao and Chang, Yuan-Chi and Dang, Xuan-Hong and Desai, Nirmit and Eres, Revital and Iwamoto, Ran and Karve, Alexei A. and Koyfman, Yan and Lee, Wei-Han and Liu, Changchang and Lublinsky, Boris and Ohko, Takuya and Pesce, Pablo and Touma, Maroun and Wang, Shiqiang and Witherspooon, Shalisha and Woisetschläger, Herbert and Wood, David and Wu, Kun-Lung and Yoshida, Issei and Zawad, Syed and Zerfos, Petros and Zhou, Yi and Bhattacharjee, Bishwaranjan},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/gohari2026iclr-gneissweb/}
}