On Learning Representations for Tabular Data Distillation

Abstract

Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present $\texttt{TDColER}$, a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, ${{\sf \small TDBench}}$. Based on an elaborate evaluation on ${{\sf \small TDBench}}$, resulting in 226,200 distilled datasets and 541,980 models trained on them, we demonstrate that $\texttt{TDColER}$ is able to boost the distilled data quality of off-the-shelf distillation schemes by 0.5-143% across 7 different tabular learning models. All of the code used in the experiments can be found in http://github.com/inwonakng/tdbench

Cite

Text

Kang et al. "On Learning Representations for Tabular Data Distillation." Transactions on Machine Learning Research, 2025.

Markdown

[Kang et al. "On Learning Representations for Tabular Data Distillation." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/kang2025tmlr-learning/)

BibTeX

@article{kang2025tmlr-learning,
  title     = {{On Learning Representations for Tabular Data Distillation}},
  author    = {Kang, Inwon and Ram, Parikshit and Zhou, Yi and Samulowitz, Horst and Seneviratne, Oshani},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/kang2025tmlr-learning/}
}