Tabular Data Generation: Can We Fool XGBoost ?
Abstract
If by 'realistic' we mean indistinguishable from (fresh) real data, generating realistic synthetic tabular data is far from being a trivial task. We present here a series of experiments showing that strong classifiers like XGBoost are able to distinguish state-of-the-art synthetic data from fresh real data almost perfectly on several tabular datasets. By studying the important features of these classifiers, we remark that mixed-type (continuous/discrete) and ill-distributed numerical columns are the ones which are the less faithfully reconstituted. We hence propose and experiment a series of automated reversible column-wise encoders which improve the realism of the generators.
Cite
Text
Zein and Urvoy. "Tabular Data Generation: Can We Fool XGBoost ?." NeurIPS 2022 Workshops: TRL, 2022.Markdown
[Zein and Urvoy. "Tabular Data Generation: Can We Fool XGBoost ?." NeurIPS 2022 Workshops: TRL, 2022.](https://mlanthology.org/neuripsw/2022/zein2022neuripsw-tabular/)BibTeX
@inproceedings{zein2022neuripsw-tabular,
title = {{Tabular Data Generation: Can We Fool XGBoost ?}},
author = {Zein, EL Hacen and Urvoy, Tanguy},
booktitle = {NeurIPS 2022 Workshops: TRL},
year = {2022},
url = {https://mlanthology.org/neuripsw/2022/zein2022neuripsw-tabular/}
}