Training Subset Selection for Weak Supervision

Abstract

Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic to select (hopefully) high-quality subsets of the weakly-labeled training data. Subset selection applies to any label model and classifier and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classifiers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Lang et al. "Training Subset Selection for Weak Supervision." Neural Information Processing Systems, 2022.

Markdown

[Lang et al. "Training Subset Selection for Weak Supervision." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/lang2022neurips-training/)

BibTeX

@inproceedings{lang2022neurips-training,
  title     = {{Training Subset Selection for Weak Supervision}},
  author    = {Lang, Hunter and Vijayaraghavan, Aravindan and Sontag, David},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/lang2022neurips-training/}
}