Scaling up Diffusion and Flow-Based XGBoost Models
Abstract
Novel machine learning methods for tabular data generation are often developed on small datasets which do not match the scale required for scientific applications. We investigate a recent proposal to use XGBoost as the function approximator in diffusion and flow-matching models on tabular data, which proved to be extremely memory intensive, even on tiny datasets. In this work, we conduct a critical analysis of the existing implementation from an engineering perspective, and show that these limitations are not fundamental to the method; with better implementation it can be scaled to datasets 370x larger than previously used. We also propose algorithmic improvements that can further benefit resource usage and model performance, including multi-output trees which are well-suited to generative modeling. Finally, we present results on large-scale scientific datasets derived from experimental particle physics as part of the Fast Calorimeter Simulation Challenge.
Cite
Text
Cresswell and Kim. "Scaling up Diffusion and Flow-Based XGBoost Models." ICML 2024 Workshops: AI4Science, 2024.Markdown
[Cresswell and Kim. "Scaling up Diffusion and Flow-Based XGBoost Models." ICML 2024 Workshops: AI4Science, 2024.](https://mlanthology.org/icmlw/2024/cresswell2024icmlw-scaling/)BibTeX
@inproceedings{cresswell2024icmlw-scaling,
title = {{Scaling up Diffusion and Flow-Based XGBoost Models}},
author = {Cresswell, Jesse C. and Kim, Taewoo},
booktitle = {ICML 2024 Workshops: AI4Science},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/cresswell2024icmlw-scaling/}
}