PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

Abstract

On-device training is the most common way to use private user data to train machine learning (ML) models. This has major drawbacks: (1) user devices are too small to train large models on-device, (2) it is communication and computation intensive for users, and (3) it can be hard to deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under the high privacy regime ($\epsilon = 1.29$). We achieve these results while using 7x less total client computation and 40x less communication than on-device training. Altogether, these results suggest in the high-privacy regime, training on DP synthetic data may be a better option than training models on-device on private distributed data.

Cite

Text

Hou et al. "PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs." ICLR 2024 Workshops: PML, 2024.

Markdown

[Hou et al. "PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs." ICLR 2024 Workshops: PML, 2024.](https://mlanthology.org/iclrw/2024/hou2024iclrw-pretext/)

BibTeX

@inproceedings{hou2024iclrw-pretext,
  title     = {{PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs}},
  author    = {Hou, Charlie and Shrivastava, Akshat and Zhan, Hongyuan and Conway, Rylan and Le, Trang and Sagar, Adithya and Fanti, Giulia and Lazar, Daniel},
  booktitle = {ICLR 2024 Workshops: PML},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/hou2024iclrw-pretext/}
}