Scaling up Diffusion and Flow-Based XGBoost Models

Abstract

Novel machine learning methods for tabular data generation are often developed on small datasets which do not match the scale required for scientific applications. We investigate a recent proposal to use XGBoost as the function approximator in diffusion and flow-matching models on tabular data, which proved to be extremely memory intensive, even on tiny datasets. In this work, we conduct a critical analysis of the existing implementation from an engineering perspective, and show that these limitations are not fundamental to the method; with better implementation it can be scaled to datasets 370x larger than previously used. We also propose algorithmic improvements that can further benefit resource usage and model performance, including multi-output trees which are well-suited to generative modeling. Finally, we present results on large-scale scientific datasets derived from experimental particle physics as part of the Fast Calorimeter Simulation Challenge.

Cite

Text

Cresswell and Kim. "Scaling up Diffusion and Flow-Based XGBoost Models." ICML 2024 Workshops: AI4Science, 2024.

Markdown

[Cresswell and Kim. "Scaling up Diffusion and Flow-Based XGBoost Models." ICML 2024 Workshops: AI4Science, 2024.](https://mlanthology.org/icmlw/2024/cresswell2024icmlw-scaling/)

BibTeX

@inproceedings{cresswell2024icmlw-scaling,
  title     = {{Scaling up Diffusion and Flow-Based XGBoost Models}},
  author    = {Cresswell, Jesse C. and Kim, Taewoo},
  booktitle = {ICML 2024 Workshops: AI4Science},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/cresswell2024icmlw-scaling/}
}