CTSyn: A Foundation Model for Cross Tabular Data Generation

Abstract

Generative Foundation Models (GFMs) have achieved remarkable success in producing high-quality synthetic data for images and text. However, their application to tabular data presents significant challenges due to the heterogeneous nature of table features. Current cross-table learning frameworks struggle because they lack a generative model backbone and an effective mechanism to decode heterogeneous feature values. To address these challenges, we propose the Cross-Table Synthesizer (CTSyn), a diffusion-based generative foundation model for tabular data generation. CTSyn comprises two key components. The first is an autoencoder network that consolidates diverse tables into a unified latent space. It dynamically reconstructs table values using a table schema embedding, allowing adaptation to heterogeneous datasets. The second is a conditional latent diffusion model that generates samples from the learned latent space, conditioned on the table schema. Through large-scale pre-training, CTSyn outperforms existing table synthesizers on standard benchmarks in both utility and diversity. These results position CTSyn as a promising framework for synthetic table generation and lay the groundwork for developing large-scale tabular foundation models.

Cite

Text

Lin et al. "CTSyn: A Foundation Model for Cross Tabular Data Generation." International Conference on Learning Representations, 2025.

Markdown

[Lin et al. "CTSyn: A Foundation Model for Cross Tabular Data Generation." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/lin2025iclr-ctsyn/)

BibTeX

@inproceedings{lin2025iclr-ctsyn,
  title     = {{CTSyn: A Foundation Model for Cross Tabular Data Generation}},
  author    = {Lin, Xiaofeng and Xu, Chenheng and Yang, Matthew and Cheng, Guang},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/lin2025iclr-ctsyn/}
}