High-Dimensional Analysis of Synthetic Data Selection

Abstract

Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as ''synthetic data should be close to the real data distribution'', it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the *covariance shift* between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore, in some regimes, we prove that matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights for linear models carry over to deep neural networks and generative models. We empirically demonstrate that the *covariance matching* procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across various training paradigms, datasets and generative models used for augmentation.

Cite

Text

Rezaei et al. "High-Dimensional Analysis of Synthetic Data Selection." International Conference on Learning Representations, 2026.

Markdown

[Rezaei et al. "High-Dimensional Analysis of Synthetic Data Selection." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/rezaei2026iclr-highdimensional/)

BibTeX

@inproceedings{rezaei2026iclr-highdimensional,
  title     = {{High-Dimensional Analysis of Synthetic Data Selection}},
  author    = {Rezaei, Parham and Kovačević, Filip and Locatello, Francesco and Mondelli, Marco},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/rezaei2026iclr-highdimensional/}
}