Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization

Abstract

Learning models have been shown to rely on spurious correlations between non-predictive features and the associated labels in the training data, with negative implications on robustness, bias and fairness. In this work, we provide a statistical characterization of this phenomenon for high-dimensional regression, when the data contains a predictive core feature $x$ and a spurious feature $y$. Specifically, we quantify the amount of spurious correlations $\mathcal C$ learned via linear regression, in terms of the data covariance and the strength $\lambda$ of the ridge regularization. As a consequence, we first capture the simplicity of $y$ through the spectrum of its covariance, and its correlation with $x$ through the Schur complement of the full data covariance. Next, we prove a trade-off between $\mathcal C$ and the in-distribution test loss $\mathcal L$, by showing that the value of $\lambda$ that minimizes $\mathcal L$ lies in an interval where $\mathcal C$ is increasing. Finally, we investigate the effects of over-parameterization via the random features model, by showing its equivalence to regularized linear regression. Our theoretical results are supported by numerical experiments on Gaussian, Color-MNIST, and CIFAR-10 datasets.

Cite

Text

Bombari and Mondelli. "Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Bombari and Mondelli. "Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/bombari2025icml-spurious/)

BibTeX

@inproceedings{bombari2025icml-spurious,
  title     = {{Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization}},
  author    = {Bombari, Simone and Mondelli, Marco},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {4839-4873},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/bombari2025icml-spurious/}
}