Vox Populi: Collecting High-Quality Labels from a Crowd

Abstract

With the emergence of search engines and crowdsourcing websites, machine learning practitioners are faced with datasets that are labeled by a large heterogeneous set of teachers. These datasets test the limits of our existing learning theory, which largely assumes that data is sampled i.i.d. from a fixed distribution. In many cases, the number of teachers actually scales with the number of examples, with each teacher providing just a handful of labels, precluding any statistically reliable assessment of an individual teacher's quality. In this paper, we study the problem of pruning low-quality teachers in a crowd, in order to improve the label quality of our training set. Despite the hurdles mentioned above, we show that this is in fact achievable with a simple and efficient algorithm, which does not require that each example be repeatedly labeled by multiple teachers. We provide a theoretical analysis of our algorithm and back our findings with empirical evidence.

Cite

Text

Dekel and Shamir. "Vox Populi: Collecting High-Quality Labels from a Crowd." Annual Conference on Computational Learning Theory, 2009.

Markdown

[Dekel and Shamir. "Vox Populi: Collecting High-Quality Labels from a Crowd." Annual Conference on Computational Learning Theory, 2009.](https://mlanthology.org/colt/2009/dekel2009colt-vox/)

BibTeX

@inproceedings{dekel2009colt-vox,
  title     = {{Vox Populi: Collecting High-Quality Labels from a Crowd}},
  author    = {Dekel, Ofer and Shamir, Ohad},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {2009},
  url       = {https://mlanthology.org/colt/2009/dekel2009colt-vox/}
}