Towards a Statistical Theory of Data Selection Under Weak Supervision
Abstract
Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given $N$ unlabeled samples $x_{i}$, and to be given access to a 'surrogate model' that can predict labels $y_i$ better than random guessing. Our goal is to select a subset of the samples, to be denoted by {$x_{i}$}$_{i\in G}$, of size $|G|=n<N$. We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: $(i)$ Data selection can be very effective, in particular beating training on the full sample in some cases; $(ii)$ Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.
Cite
Text
Kolossov et al. "Towards a Statistical Theory of Data Selection Under Weak Supervision." International Conference on Learning Representations, 2024.Markdown
[Kolossov et al. "Towards a Statistical Theory of Data Selection Under Weak Supervision." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/kolossov2024iclr-statistical/)BibTeX
@inproceedings{kolossov2024iclr-statistical,
title = {{Towards a Statistical Theory of Data Selection Under Weak Supervision}},
author = {Kolossov, Germain and Montanari, Andrea and Tandon, Pulkit},
booktitle = {International Conference on Learning Representations},
year = {2024},
url = {https://mlanthology.org/iclr/2024/kolossov2024iclr-statistical/}
}