Perplexed by Perplexity: Perplexity-Based Data Pruning with Small Reference Models

Abstract

In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected by the domain composition of the data being pruned. We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can significantly improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and achieves up to a 1.45× reduction in pretraining steps to reach commensurate baseline performance. Furthermore, we demonstrate that such perplexity-based data pruning also yields downstream performance gains in the over-trained and data-constrained regimes.

Cite

Text

Ankner et al. "Perplexed by Perplexity: Perplexity-Based Data Pruning with Small Reference Models." International Conference on Learning Representations, 2025.

Markdown

[Ankner et al. "Perplexed by Perplexity: Perplexity-Based Data Pruning with Small Reference Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/ankner2025iclr-perplexed/)

BibTeX

@inproceedings{ankner2025iclr-perplexed,
  title     = {{Perplexed by Perplexity: Perplexity-Based Data Pruning with Small Reference Models}},
  author    = {Ankner, Zachary and Blakeney, Cody and Sreenivasan, Kartik and Marion, Max and Leavitt, Matthew L and Paul, Mansheej},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/ankner2025iclr-perplexed/}
}