PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

Charlie Hou, Akshat Shrivastava, Hongyuan Zhan, Rylan Conway, Trang Le, Adithya Sagar, Giulia Fanti, Daniel Lazar

ICML 2024 pp. 19043-19061

/icml/2024/hou2024icml-pretext/

Abstract

On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes ($\epsilon=1.29$, $\epsilon=7.58$). We achieve these results while using 9$\times$ fewer rounds, 6$\times$ less client computation per round, and 100$\times$ less communication per round. Second, finetuning large models on PrE-Text’s DP synthetic data improves large language model (LLM) performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at https://github.com/houcharlie/PrE-Text.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Hou et al. "PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs." International Conference on Machine Learning, 2024.

Markdown

[Hou et al. "PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/hou2024icml-pretext/)

BibTeX

@inproceedings{hou2024icml-pretext,
  title     = {{PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs}},
  author    = {Hou, Charlie and Shrivastava, Akshat and Zhan, Hongyuan and Conway, Rylan and Le, Trang and Sagar, Adithya and Fanti, Giulia and Lazar, Daniel},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {19043-19061},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/hou2024icml-pretext/}
}