Naive Imputation Implicitly Regularizes High-Dimensional Linear Models

Abstract

Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) use, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits good predictive performance. In this paper, we study the impact of imputation in a high-dimensional linear model with MCAR missing data. We prove that zero imputation performs an implicit regularization closely related to the ridge method, often used in high-dimensional problems. Leveraging on this connection, we establish that the imputation bias is controlled by a ridge bias, which vanishes in high dimension. As a predictor, we argue in favor of the averaged SGD strategy, applied to zero-imputed data. We establish an upper bound on its generalization error, highlighting that imputation is benign in the $d \gg \sqrt{n}$ regime. Experiments illustrate our findings.

Cite

Text

Ayme et al. "Naive Imputation Implicitly Regularizes High-Dimensional Linear Models." International Conference on Machine Learning, 2023.

Markdown

[Ayme et al. "Naive Imputation Implicitly Regularizes High-Dimensional Linear Models." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/ayme2023icml-naive/)

BibTeX

@inproceedings{ayme2023icml-naive,
  title     = {{Naive Imputation Implicitly Regularizes High-Dimensional Linear Models}},
  author    = {Ayme, Alexis and Boyer, Claire and Dieuleveut, Aymeric and Scornet, Erwan},
  booktitle = {International Conference on Machine Learning},
  year      = {2023},
  pages     = {1320-1340},
  volume    = {202},
  url       = {https://mlanthology.org/icml/2023/ayme2023icml-naive/}
}