Out-of-Domain Unlabeled Data Improves Generalization

Abstract

We propose a novel framework for incorporating unlabeled data into semi-supervised classification problems, where scenarios involving the minimization of either i) adversarially robust or ii) non-robust loss functions have been considered. Notably, we allow the unlabeled samples to deviate slightly (in total variation sense) from the in-domain distribution. The core idea behind our framework is to combine Distributionally Robust Optimization (DRO) with self-supervised training. As a result, we also leverage efficient polynomial-time algorithms for the training stage. From a theoretical standpoint, we apply our framework on the classification problem of a mixture of two Gaussians in $\mathbb{R}^d$, where in addition to the $m$ independent and labeled samples from the true distribution, a set of $n$ (usually with $n\gg m$) out of domain and unlabeled samples are gievn as well. Using only the labeled data, it is known that the generalization error can be bounded by $\propto\left(d/m\right)^{1/2}$. However, using our method on both isotropic and non-isotropic Gaussian mixture models, one can derive a new set of analytically explicit and non-asymptotic bounds which show substantial improvement on the generalization error compared ERM. Our results underscore two significant insights: 1) out-of-domain samples, even when unlabeled, can be harnessed to narrow the generalization gap, provided that the true data distribution adheres to a form of the "cluster assumption", and 2) the semi-supervised learning paradigm can be regarded as a special case of our framework when there are no distributional shifts. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.

Cite

Text

Saberi et al. "Out-of-Domain Unlabeled Data Improves Generalization." International Conference on Learning Representations, 2024.

Markdown

[Saberi et al. "Out-of-Domain Unlabeled Data Improves Generalization." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/saberi2024iclr-outofdomain/)

BibTeX

@inproceedings{saberi2024iclr-outofdomain,
  title     = {{Out-of-Domain Unlabeled Data Improves Generalization}},
  author    = {Saberi, Seyed Amir Hossein and Najafi, Amir and Heidari, Alireza and Movasaghinia, Mohammad Hosein and Motahari, Abolfazl and Khalaj, Babak},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/saberi2024iclr-outofdomain/}
}