Generating and Imputing Tabular Data via Diffusion and Flow-Based Gradient-Boosted Trees

Abstract

Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at \url{https://github.com/SamsungSAILMontreal/ForestDiffusion}.

Cite

Text

Jolicoeur-Martineau et al. "Generating and Imputing Tabular Data via Diffusion and Flow-Based Gradient-Boosted Trees." Artificial Intelligence and Statistics, 2024.

Markdown

[Jolicoeur-Martineau et al. "Generating and Imputing Tabular Data via Diffusion and Flow-Based Gradient-Boosted Trees." Artificial Intelligence and Statistics, 2024.](https://mlanthology.org/aistats/2024/jolicoeurmartineau2024aistats-generating/)

BibTeX

@inproceedings{jolicoeurmartineau2024aistats-generating,
  title     = {{Generating and Imputing Tabular Data via Diffusion and Flow-Based Gradient-Boosted Trees}},
  author    = {Jolicoeur-Martineau, Alexia and Fatras, Kilian and Kachman, Tal},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2024},
  pages     = {1288-1296},
  volume    = {238},
  url       = {https://mlanthology.org/aistats/2024/jolicoeurmartineau2024aistats-generating/}
}