Tabular Data Generation: Can We Fool XGBoost ?

Abstract

If by 'realistic' we mean indistinguishable from (fresh) real data, generating realistic synthetic tabular data is far from being a trivial task. We present here a series of experiments showing that strong classifiers like XGBoost are able to distinguish state-of-the-art synthetic data from fresh real data almost perfectly on several tabular datasets. By studying the important features of these classifiers, we remark that mixed-type (continuous/discrete) and ill-distributed numerical columns are the ones which are the less faithfully reconstituted. We hence propose and experiment a series of automated reversible column-wise encoders which improve the realism of the generators.

Cite

Text

Zein and Urvoy. "Tabular Data Generation: Can We Fool XGBoost ?." NeurIPS 2022 Workshops: TRL, 2022.

Markdown

[Zein and Urvoy. "Tabular Data Generation: Can We Fool XGBoost ?." NeurIPS 2022 Workshops: TRL, 2022.](https://mlanthology.org/neuripsw/2022/zein2022neuripsw-tabular/)

BibTeX

@inproceedings{zein2022neuripsw-tabular,
  title     = {{Tabular Data Generation: Can We Fool XGBoost ?}},
  author    = {Zein, EL Hacen and Urvoy, Tanguy},
  booktitle = {NeurIPS 2022 Workshops: TRL},
  year      = {2022},
  url       = {https://mlanthology.org/neuripsw/2022/zein2022neuripsw-tabular/}
}