Semi-Supervised Aggregation of Dependent Weak Supervision Sources with Performance Guarantees

Abstract

We develop a novel method that provides theoretical guarantees for learning from weak labelers without the (mostly unrealistic) assumption that the errors of the weak labelers are independent or come from a particular family of distributions. We show a rigorous technique for efficiently selecting small subsets of the labelers so that a majority vote from such subsets has a provably low error rate. We explore several extensions of this method and provide experimental results over a range of labeled data set sizes on 45 image classification tasks. Our performance-guaranteed methods consistently match the best performing alternative, which varies based on problem difficulty. On tasks with accurate weak labelers, our methods are on average 3 percentage points more accurate than the state-of-the-art adversarial method. On tasks with inaccurate weak labelers, our methods are on average 15 percentage points more accurate than the semi-supervised Dawid-Skene model (which assumes independence).

Cite

Text

Mazzetto et al. "Semi-Supervised Aggregation of Dependent Weak Supervision Sources with Performance Guarantees." Artificial Intelligence and Statistics, 2021.

Markdown

[Mazzetto et al. "Semi-Supervised Aggregation of Dependent Weak Supervision Sources with Performance Guarantees." Artificial Intelligence and Statistics, 2021.](https://mlanthology.org/aistats/2021/mazzetto2021aistats-semisupervised/)

BibTeX

@inproceedings{mazzetto2021aistats-semisupervised,
  title     = {{Semi-Supervised Aggregation of Dependent Weak Supervision Sources with Performance Guarantees}},
  author    = {Mazzetto, Alessio and Sam, Dylan and Park, Andrew and Upfal, Eli and Bach, Stephen},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2021},
  pages     = {3196-3204},
  volume    = {130},
  url       = {https://mlanthology.org/aistats/2021/mazzetto2021aistats-semisupervised/}
}