Differentially-Private Data Synthetisation for Efficient Re-Identification Risk Control

Carvalho, Tânia; Moniz, Nuno; Antunes, Luis; Chawla, Nitesh V.

doi:10.1007/S10994-025-06799-W

Differentially-Private Data Synthetisation for Efficient Re-Identification Risk Control

Tânia Carvalho, Nuno Moniz, Luis Antunes, Nitesh V. Chawla

MLJ 2025 pp. 164

doi:10.1007/S10994-025-06799-W /mlj/2025/carvalho2025mlj-differentiallyprivate/

Abstract

Protecting user data privacy can be achieved via many methods, from statistical transformations to generative models. However, they all have critical drawbacks. For example, creating a transformed data set using traditional techniques is highly time-consuming. Also, recent deep learning-based solutions require significant computational resources in addition to long training phases, and differentially private-based solutions may undermine data utility. In this paper, we propose ϵ\documentclass[12pt]minimal \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}-69pt \begin{document}$\epsilon$\end{document}-PrivateSMOTE, a technique designed to protect against re-identification and linkage attacks, particularly addressing cases with a high re-identification risk. Our proposal combines synthetic data generation via noise-induced interpolation with differential privacy principles to obfuscate high-risk cases. We demonstrate how ϵ\documentclass[12pt]minimal \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}-69pt \begin{document}$\epsilon$\end{document}-PrivateSMOTE is capable of achieving competitive results in privacy risk and better predictive performance when compared to multiple traditional and state-of-the-art privacy-preservation methods, including generative adversarial networks, variational autoencoders, and differential privacy baselines. We also show how our method improves time requirements by at least a factor of 9 and is a resource-efficient solution that ensures high performance without specialised hardware.

PDF MLJ Semantic Scholar

Cite

Text

Carvalho et al. "Differentially-Private Data Synthetisation for Efficient Re-Identification Risk Control." Machine Learning, 2025. doi:10.1007/S10994-025-06799-W

Markdown

[Carvalho et al. "Differentially-Private Data Synthetisation for Efficient Re-Identification Risk Control." Machine Learning, 2025.](https://mlanthology.org/mlj/2025/carvalho2025mlj-differentiallyprivate/) doi:10.1007/S10994-025-06799-W

BibTeX

@article{carvalho2025mlj-differentiallyprivate,
  title     = {{Differentially-Private Data Synthetisation for Efficient Re-Identification Risk Control}},
  author    = {Carvalho, Tânia and Moniz, Nuno and Antunes, Luis and Chawla, Nitesh V.},
  journal   = {Machine Learning},
  year      = {2025},
  pages     = {164},
  doi       = {10.1007/S10994-025-06799-W},
  volume    = {114},
  url       = {https://mlanthology.org/mlj/2025/carvalho2025mlj-differentiallyprivate/}
}