EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption
Abstract
As large language models (LLMs) become more powerful, the computation required to run these models is increasingly outsourced to a third-party cloud. While this saves clients’ computation, it risks leaking the clients’ LLM queries to the cloud provider. Fully homomorphic encryption (FHE) presents a natural solution to this problem: simply encrypt the query and evaluate the LLM homomorphically on the cloud machine. The result remains encrypted and can only be learned by the client who holds the secret key. In this work, we present a GPU-accelerated implementation of FHE and use this implementation to benchmark an encrypted GPT-2 forward pass, with runtimes over $200\times$ faster than the CPU baseline. We also present novel and extensive experimental analysis of approximations of LLM activation functions to maintain accuracy while achieving this performance.
Cite
Text
De Castro et al. "EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[De Castro et al. "EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/decastro2025icml-encryptedllm/)BibTeX
@inproceedings{decastro2025icml-encryptedllm,
title = {{EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption}},
author = {De Castro, Leo and Escudero, Daniel and Agrawal, Adya and Polychroniadou, Antigoni and Veloso, Manuela},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {12677-12688},
volume = {267},
url = {https://mlanthology.org/icml/2025/decastro2025icml-encryptedllm/}
}