A Theoretical Characterization of Semi-Supervised Learning with Self-Training for Gaussian Mixture Models

Abstract

Self-training is a classical approach in semi-supervised learning which is successfully applied to a variety of machine learning problems. Self-training algorithms generate pseudo-labels for the unlabeled examples and progressively refine these pseudo-labels which hopefully coincides with the actual labels. This work provides theoretical insights into self-training algorithms with a focus on linear classifiers. First, we provide a sample complexity analysis for Gaussian mixture models with two components. This is established by sharp non-asymptotic characterization of the self-training iterations which captures the evolution of the model accuracy in terms of a fixed-point iteration. Our analysis reveals the provable benefits of rejecting samples with low confidence and demonstrates how self-training iterations can gracefully improve the model accuracy. Secondly, we study a generalized GMM where the component means follow a distribution. We demonstrate that ridge regularization and class margin (i.e. separation between the component means) is crucial for the success and lack of regularization may prevent self-training from identifying the core features in the data.

Cite

Text

Oymak and Cihad Gulcu. "A Theoretical Characterization of Semi-Supervised Learning with Self-Training for Gaussian Mixture Models." Artificial Intelligence and Statistics, 2021.

Markdown

[Oymak and Cihad Gulcu. "A Theoretical Characterization of Semi-Supervised Learning with Self-Training for Gaussian Mixture Models." Artificial Intelligence and Statistics, 2021.](https://mlanthology.org/aistats/2021/oymak2021aistats-theoretical/)

BibTeX

@inproceedings{oymak2021aistats-theoretical,
  title     = {{A Theoretical Characterization of Semi-Supervised Learning with Self-Training for Gaussian Mixture Models}},
  author    = {Oymak, Samet and Cihad Gulcu, Talha},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2021},
  pages     = {3601-3609},
  volume    = {130},
  url       = {https://mlanthology.org/aistats/2021/oymak2021aistats-theoretical/}
}