SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas

TMLR 2025

/tmlr/2025/robey2025tmlr-smoothllm/

Abstract

Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, an algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM offers improved robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM.

PDF TMLR Code Semantic Scholar

Cite

Text

Robey et al. "SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks." Transactions on Machine Learning Research, 2025.

Markdown

[Robey et al. "SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/robey2025tmlr-smoothllm/)

BibTeX

@article{robey2025tmlr-smoothllm,
  title     = {{SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks}},
  author    = {Robey, Alexander and Wong, Eric and Hassani, Hamed and Pappas, George J.},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/robey2025tmlr-smoothllm/}
}