SmallToLarge (S2L): Scalable Data Selection for Fine-Tuning Large Language Models by Summarizing Training Trajectories of Small Models

Abstract

Despite the effectiveness of data selection for pretraining and instruction fine-tuninglarge language models (LLMs), improving data efficiency in supervised fine-tuning(SFT) for specialized domains poses significant challenges due to the complexityof fine-tuning data. To bridge this gap, we introduce an effective and scalabledata selection method for SFT, SmallToLarge (S2L), which trains a smallmodel, clusters loss trajectories of the examples, and samples from these clusters toguide data selection for larger models. We prove that during fine-tuning, sampleswithin the same loss trajectory cluster exhibit similar gradients. Then, we showthat S2L subsets have a bounded gradient error w.r.t. the full data, hence guaranteeconvergence to the neighborhood of the optimal solution. We demonstrate throughextensive experiments that S2L significantly improves data efficiency in SFT formathematical problem-solving, reducing the training data requirement to just $11$%of the original MathInstruct dataset to match full dataset performance whileoutperforming state-of-the-art data selection algorithms by an average of $4.7$%across $6$ in- and out-domain evaluation datasets. Remarkably, selecting only 50Kdata for SFT, S2L achieves a $32.7$% accuracy on the challenging MATHbenchmark, improving Phi-2 by $16.6$%. In clinical text summarization on theMIMIC-III dataset, S2L again outperforms training on the full dataset usingonly $50$% of the data. Notably, S2L can perform scalable data selection using areference model $100\times$ smaller than the target model, proportionally reducing thecomputational cost.

Cite

Text

Yang et al. "SmallToLarge (S2L): Scalable Data Selection for Fine-Tuning Large Language Models by Summarizing Training Trajectories of Small Models." Neural Information Processing Systems, 2024. doi:10.52202/079017-2655

Markdown

[Yang et al. "SmallToLarge (S2L): Scalable Data Selection for Fine-Tuning Large Language Models by Summarizing Training Trajectories of Small Models." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/yang2024neurips-smalltolarge/) doi:10.52202/079017-2655

BibTeX

@inproceedings{yang2024neurips-smalltolarge,
  title     = {{SmallToLarge (S2L): Scalable Data Selection for Fine-Tuning Large Language Models by Summarizing Training Trajectories of Small Models}},
  author    = {Yang, Yu and Mishra, Siddhartha and Chiang, Jeffrey and Mirzasoleiman, Baharan},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-2655},
  url       = {https://mlanthology.org/neurips/2024/yang2024neurips-smalltolarge/}
}