Does Unlabeled Data Provably Help? Worst-Case Analysis of the Sample Complexity of Semi-Supervised Learning
Abstract
We study the potential benefits to classification prediction that arise from having access to unlabeled samples. We compare learning in the semi-supervised model to the standard, supervised PAC (distribution free) model, considering both the realizable and the unrealizable (agnostic) settings. Roughly speaking, our conclusion is that access to unlabeled samples cannot provide sample size guarantees that are better than those obtainable without access to unlabeled data, unless one postulates very strong assumptions about the distribution of the labels. In particular, we prove that for basic hypothesis classes over the real line, if the distribution of unlabeled data is 'smooth', knowledge of that distribution cannot improve the labeled sample complexity by more than a constant factor (e.g., 2). We conjecture that a similar phenomena holds for any hypothesis class and any unlabeled data distribution. We also discuss the utility of semi-supervised learning under the common cluster assumption concerning the distribution of labels, and show that even in the most accommodating cases, where data is generated by two uni-modal label-homogeneous distributions, common SSL paradigms may be misleading and may result in poor prediction performance.
Cite
Text
Ben-David et al. "Does Unlabeled Data Provably Help? Worst-Case Analysis of the Sample Complexity of Semi-Supervised Learning." Annual Conference on Computational Learning Theory, 2008.Markdown
[Ben-David et al. "Does Unlabeled Data Provably Help? Worst-Case Analysis of the Sample Complexity of Semi-Supervised Learning." Annual Conference on Computational Learning Theory, 2008.](https://mlanthology.org/colt/2008/bendavid2008colt-unlabeled/)BibTeX
@inproceedings{bendavid2008colt-unlabeled,
title = {{Does Unlabeled Data Provably Help? Worst-Case Analysis of the Sample Complexity of Semi-Supervised Learning}},
author = {Ben-David, Shai and Lu, Tyler and Pál, Dávid},
booktitle = {Annual Conference on Computational Learning Theory},
year = {2008},
pages = {33-44},
url = {https://mlanthology.org/colt/2008/bendavid2008colt-unlabeled/}
}