An Empirical Study on Impact of Label Noise on Synthetic Tabular Data Generation
Abstract
Synthetic data has been actively used for various machine learning-based tasks due to its benefits such as massive reproducibility and privacy enhancement compared to using the original data. The quality of the generated synthetic dataset crucially depends on the quality of the original data, and the latter is often corrupted by label noise. While there have been studies on feature noise, how label noise affects synthetic data generation is under-explored. In this paper, we evaluate the impact of the noisy label on synthetic data generation with a focus on tabular data. One challenge is how to evaluate the quality of synthetic data under label noise. To this end, we design comprehensive experiments to measure the impact of label noise on synthetic data generation in different aspects: synthetic data quality, data utility, and convergence for training synthesizers and machine learning models for downstream tasks. The empirical results cover wide aspects of synthetic data generation under label noise and they show quality and utility degrades with higher noise levels while there is no significant effect on the synthesizer convergence observed.
Cite
Text
Kim et al. "An Empirical Study on Impact of Label Noise on Synthetic Tabular Data Generation." Machine Learning, 2025. doi:10.1007/S10994-024-06629-5Markdown
[Kim et al. "An Empirical Study on Impact of Label Noise on Synthetic Tabular Data Generation." Machine Learning, 2025.](https://mlanthology.org/mlj/2025/kim2025mlj-empirical/) doi:10.1007/S10994-024-06629-5BibTeX
@article{kim2025mlj-empirical,
title = {{An Empirical Study on Impact of Label Noise on Synthetic Tabular Data Generation}},
author = {Kim, Jeonghoon and Huang, Chao and Liu, Xin},
journal = {Machine Learning},
year = {2025},
pages = {90},
doi = {10.1007/S10994-024-06629-5},
volume = {114},
url = {https://mlanthology.org/mlj/2025/kim2025mlj-empirical/}
}