Data Efficient Pre-Training for Language Models: An Empirical Study of Compute Efficiency and Linguistic Competence
Abstract
Training large language models is compute- and data-intensive, limiting optimisation and low-resource training, and increasing environmental impact. This paper examines pre-training effectiveness of language models of different sizes on two small, curated datasets and evaluates (i) linguistic competence and (ii) compute efficiency. The datasets are TinyStories, a collection of ChatGPT-generated children's stories, and BabyLM, a small, open-domain dataset. We perform experiments with increasing amounts of data (yielding a learning curve) and size-variants of a Llama-based, decoder-only architecture. We evaluate the pre-trained models on downstream tasks from the BLiMP and GLUE benchmark suites. We find that models trained on BabyLM outperform those trained on TinyStories on formal linguistic competence, but not on functional linguistic tasks. Models pre-trained on BabyLM yield more consistent performance results, as indicated by lower variance across random seeds. We also find that small data samples are representative of the model's ultimate performance, which can aid the early selection of promising candidate models. These findings emphasise the potential of pre-training on small, curated datasets for data-efficient pre-training in resource-constrained settings. Further work that includes additional datasets and model architectures is needed to extend the scope of these findings.
Cite
Text
Paraskeva et al. "Data Efficient Pre-Training for Language Models: An Empirical Study of Compute Efficiency and Linguistic Competence." ICLR 2025 Workshops: Data_Problems, 2025.Markdown
[Paraskeva et al. "Data Efficient Pre-Training for Language Models: An Empirical Study of Compute Efficiency and Linguistic Competence." ICLR 2025 Workshops: Data_Problems, 2025.](https://mlanthology.org/iclrw/2025/paraskeva2025iclrw-data/)BibTeX
@inproceedings{paraskeva2025iclrw-data,
title = {{Data Efficient Pre-Training for Language Models: An Empirical Study of Compute Efficiency and Linguistic Competence}},
author = {Paraskeva, Andreas and van Duijn, Max Johannes and de Rijke, Maarten and Verberne, Suzan and van Rijn, Jan N.},
booktitle = {ICLR 2025 Workshops: Data_Problems},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/paraskeva2025iclrw-data/}
}