Data Selection via Optimal Control for Language Models
Abstract
This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs' capabilities for downstream usage. We formulate data selection as a generalized Optimal Control problem, which can be solved theoretically by Pontryagin's Maximum Principle (PMP), yielding a set of necessary conditions that characterize the relationship between optimal data selection and LM training dynamics. Based on these theoretical results, we introduce **P**MP-based **D**ata **S**election (**PDS**), a framework that approximates optimal data selection by solving the PMP conditions. In our experiments, we adopt PDS to select data from CommmonCrawl and show that the PDS-selected corpus accelerates the learning of LMs and constantly boosts their performance on a wide range of downstream tasks across various model sizes. Moreover, the benefits of PDS extend to ~400B models trained on ~10T tokens, as evidenced by the extrapolation of the test loss curves according to the Scaling Laws. PDS also improves data utilization when the pre-training data is limited, by reducing the data demand by 1.8 times, which helps mitigate the quick exhaustion of available web-crawled corpora. Our code, model, and data can be found at https://github.com/microsoft/LMOps/tree/main/data_selection.
Cite
Text
Gu et al. "Data Selection via Optimal Control for Language Models." International Conference on Learning Representations, 2025.Markdown
[Gu et al. "Data Selection via Optimal Control for Language Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/gu2025iclr-data/)BibTeX
@inproceedings{gu2025iclr-data,
title = {{Data Selection via Optimal Control for Language Models}},
author = {Gu, Yuxian and Dong, Li and Wang, Hongning and Hao, Yaru and Dong, Qingxiu and Wei, Furu and Huang, Minlie},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/gu2025iclr-data/}
}