CTSyn: A Foundation Model for Cross Tabular Data Generation
Abstract
Generative Foundation Models (GFMs) have achieved remarkable success in producing high-quality synthetic data for images and text. However, their application to tabular data presents significant challenges due to the heterogeneous nature of table features. Current cross-table learning frameworks struggle because they lack a generative model backbone and an effective mechanism to decode heterogeneous feature values. To address these challenges, we propose the Cross-Table Synthesizer (CTSyn), a diffusion-based generative foundation model for tabular data generation. CTSyn comprises two key components. The first is an autoencoder network that consolidates diverse tables into a unified latent space. It dynamically reconstructs table values using a table schema embedding, allowing adaptation to heterogeneous datasets. The second is a conditional latent diffusion model that generates samples from the learned latent space, conditioned on the table schema. Through large-scale pre-training, CTSyn outperforms existing table synthesizers on standard benchmarks in both utility and diversity. These results position CTSyn as a promising framework for synthetic table generation and lay the groundwork for developing large-scale tabular foundation models.
Cite
Text
Lin et al. "CTSyn: A Foundation Model for Cross Tabular Data Generation." International Conference on Learning Representations, 2025.Markdown
[Lin et al. "CTSyn: A Foundation Model for Cross Tabular Data Generation." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/lin2025iclr-ctsyn/)BibTeX
@inproceedings{lin2025iclr-ctsyn,
title = {{CTSyn: A Foundation Model for Cross Tabular Data Generation}},
author = {Lin, Xiaofeng and Xu, Chenheng and Yang, Matthew and Cheng, Guang},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/lin2025iclr-ctsyn/}
}