FlipAttack: Jailbreak LLMs via Flipping

Abstract

This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when the perturbation is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing a left-side perturbation merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task and then develop 4 variants to guide LLMs to understand and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves $\sim$78.97% attack success rate across 8 LLMs on average and $\sim$98% bypass rate against 5 guard models on average.

Cite

Text

Liu et al. "FlipAttack: Jailbreak LLMs via Flipping." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Liu et al. "FlipAttack: Jailbreak LLMs via Flipping." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/liu2025icml-flipattack/)

BibTeX

@inproceedings{liu2025icml-flipattack,
  title     = {{FlipAttack: Jailbreak LLMs via Flipping}},
  author    = {Liu, Yue and He, Xiaoxin and Xiong, Miao and Fu, Jinlan and Deng, Shumin and Ma, Yingwei and Zhang, Jiaheng and Hooi, Bryan},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {38623-38663},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/liu2025icml-flipattack/}
}