Jailbreaking Large Language Models with Symbolic Mathematics
Abstract
Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs' advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6\%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping explain the attack's success. This work emphasizes the importance of a holistic approach to AI safety, calling for expanded red-teaming efforts to develop robust safeguards across all potential input types and their associated risks.
Cite
Text
Bethany et al. "Jailbreaking Large Language Models with Symbolic Mathematics." NeurIPS 2024 Workshops: SoLaR, 2024.Markdown
[Bethany et al. "Jailbreaking Large Language Models with Symbolic Mathematics." NeurIPS 2024 Workshops: SoLaR, 2024.](https://mlanthology.org/neuripsw/2024/bethany2024neuripsw-jailbreaking/)BibTeX
@inproceedings{bethany2024neuripsw-jailbreaking,
title = {{Jailbreaking Large Language Models with Symbolic Mathematics}},
author = {Bethany, Emet and Bethany, Mazal and Nolazco-Flores, Juan A. and Jha, Sumit Kumar and Najafirad, Peyman},
booktitle = {NeurIPS 2024 Workshops: SoLaR},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/bethany2024neuripsw-jailbreaking/}
}