A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-Level Privacy Leakage

Abstract

Sanitizing sensitive text data for release often relies on methods that remove personally identifiable information (PII) or generate synthetic data. However, evaluations of these methods have focused on measuring surface-level privacy leakage (e.g., revealing explicit identifiers like names). We propose the first semantic privacy evaluation framework for sanitized textual datasets, leveraging re-identification attacks. On medical records and chatbot dialogue datasets, we demonstrate that seemingly innocuous auxiliary information, such as a mention of specific speech patterns, can be used to deduce sensitive attributes like age or substance use history. PII removal techniques make only surface-level textual manipulations: e.g., the industry-standard Azure PII removal tool fails to protect 89\% of the original information. On the other hand, synthesizing data with differential privacy protects sensitive information but garbles the data, rendering it much less useful for downstream tasks. Our findings reveal that current data sanitization methods create a \textit{false sense of privacy}, and underscore the urgent need for more robust methods that both protect privacy and preserve utility.

Cite

Text

Xin et al. "A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-Level Privacy Leakage." ICLR 2025 Workshops: BuildingTrust, 2025.

Markdown

[Xin et al. "A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-Level Privacy Leakage." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/xin2025iclrw-false/)

BibTeX

@inproceedings{xin2025iclrw-false,
  title     = {{A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-Level Privacy Leakage}},
  author    = {Xin, Rui and Mireshghallah, Niloofar and Li, Shuyue Stella and Duan, Michael and Kim, Hyunwoo and Choi, Yejin and Tsvetkov, Yulia and Oh, Sewoong and Koh, Pang Wei},
  booktitle = {ICLR 2025 Workshops: BuildingTrust},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/xin2025iclrw-false/}
}