The Value of Out-of-Distribution Data

Abstract

More data is expected to help us generalize to a task. But real datasets can contain out-of-distribution (OOD) data; this can come in the form of heterogeneity such as intra-class variability but also in the form of temporal shifts or concept drifts. We demonstrate a counter-intuitive phenomenon for such problems: generalization error of the task can be a non-monotonic function of the number of OOD samples; a small number of OOD samples can improve generalization but if the number of OOD samples is beyond a threshold, then the generalization error can deteriorate. We also show that if we know which samples are OOD, then using a weighted objective between the target and OOD samples ensures that the generalization error decreases monotonically. We demonstrate and analyze this phenomenon using linear classifiers on synthetic datasets and medium-sized neural networks on vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS, and DomainNet.

Cite

Text

De Silva et al. "The Value of Out-of-Distribution Data." NeurIPS 2022 Workshops: DistShift, 2022.

Markdown

[De Silva et al. "The Value of Out-of-Distribution Data." NeurIPS 2022 Workshops: DistShift, 2022.](https://mlanthology.org/neuripsw/2022/silva2022neuripsw-value/)

BibTeX

@inproceedings{silva2022neuripsw-value,
  title     = {{The Value of Out-of-Distribution Data}},
  author    = {De Silva, Ashwin and Ramesh, Rahul and Priebe, Carey and Chaudhari, Pratik and Vogelstein, Joshua T},
  booktitle = {NeurIPS 2022 Workshops: DistShift},
  year      = {2022},
  url       = {https://mlanthology.org/neuripsw/2022/silva2022neuripsw-value/}
}