Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Abstract

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Holder continuous, our approach provably allows selecting a set of “typical” $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input embeddings and $\lambda$ is the Holder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performance of leverage score sampling, while being conceptually simpler and more scalable.

Cite

Text

Axiotis et al. "Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond." International Conference on Machine Learning, 2024.

Markdown

[Axiotis et al. "Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/axiotis2024icml-dataefficient/)

BibTeX

@inproceedings{axiotis2024icml-dataefficient,
  title     = {{Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond}},
  author    = {Axiotis, Kyriakos and Cohen-Addad, Vincent and Henzinger, Monika and Jerome, Sammy and Mirrokni, Vahab and Saulpic, David and Woodruff, David and Wunder, Michael},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {2086-2107},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/axiotis2024icml-dataefficient/}
}