Diffusion Models for Missing Value Imputation in Tabular Data

Abstract

Missing value imputation in machine learning is the task of estimating the missing values in the dataset reasonably using available information. In this task, several deep generative modeling methods have been proposed and demonstrated their usefulness, e.g., generative adversarial imputation networks. Recently, diffusion models have gained popularity because of their effectiveness in the generative modeling task in images, texts, audio, etc. To our knowledge, less attention has been paid on the investigation of the effectiveness of diffusion models for missing value imputation in tabular data. Based on a recent development of diffusion models for time-series data imputation, we propose a diffusion model approach called ``Conditional Score-based Diffusion Models for Tabular data'' (CSDI_T). To effectively handle categorical variables and numerical variables simultaneously, we investigate three techniques: one-hot encoding, analog bit encoding, and feature tokenization. Experimental results on benchmark datasets demonstrated the effectiveness of CSDI_T compared with well-known existing methods, and also emphasized the importance of the categorical embedding techniques.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Zheng and Charoenphakdee. "Diffusion Models for Missing Value Imputation in Tabular Data." NeurIPS 2022 Workshops: TRL, 2022.

Markdown

[Zheng and Charoenphakdee. "Diffusion Models for Missing Value Imputation in Tabular Data." NeurIPS 2022 Workshops: TRL, 2022.](https://mlanthology.org/neuripsw/2022/zheng2022neuripsw-diffusion/)

BibTeX

@inproceedings{zheng2022neuripsw-diffusion,
  title     = {{Diffusion Models for Missing Value Imputation in Tabular Data}},
  author    = {Zheng, Shuhan and Charoenphakdee, Nontawat},
  booktitle = {NeurIPS 2022 Workshops: TRL},
  year      = {2022},
  url       = {https://mlanthology.org/neuripsw/2022/zheng2022neuripsw-diffusion/}
}