LLM Data Selection and Utilization via Dynamic Bi-Level Optimization

ICML 2025 pp. 72995-73008

Abstract

While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model’s data preferences evolve throughout training, providing new insights into the data preference of the model during training.

Cite

Text

Yu et al. "LLM Data Selection and Utilization via Dynamic Bi-Level Optimization." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Yu et al. "LLM Data Selection and Utilization via Dynamic Bi-Level Optimization." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/yu2025icml-llm/)

BibTeX

@inproceedings{yu2025icml-llm,
  title     = {{LLM Data Selection and Utilization via Dynamic Bi-Level Optimization}},
  author    = {Yu, Yang and Han, Kai and Zhou, Hang and Tang, Yehui and Huang, Kaiqi and Wang, Yunhe and Tao, Dacheng},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {72995-73008},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/yu2025icml-llm/}
}