PANDORA: Detailed LLM Jailbreaking via Collaborated Phishing Agents with Decomposed Reasoning

Abstract

While the breakthrough of large language models (LLMs) has brought significant advancement to the development of natural language processing, it also introduces new vulnerabilities, especially in security and privacy. Jailbreak attacks, a core component of red-teaming LLMs, have been an effective way to better understand and enhance LLMs security, through testing the resilience of existing safety features and simulating real-world attacks. In this paper, we propose **PANDORA**, a novel approach designed for LLMs jailbreaking through collaborated phishing agents with decomposed reasoning. PANDORA uniquely leverages the multi-step reasoning capabilities of the LLMs, decomposing adversarial attacks into stealthier sub-queries to elicit more informative responses. More specifically, it consists of four collaborated sub-modules, where each is tailored to refine the attack strategy dynamically when producing the adversarial response. In addition, we propose two new metrics, **PASS** and **Adv-NER**, to complement the current jailbreaking evaluations with response quality measures that work without ground-truths. Extensive experiments conducted on the AdvBench-subset demonstrate PANDORA's superior performance over existing state-of-the-arts on four major victim models. More notably, even a more efficient, distilled version of the original PANDORA, demonstrates high success rates on LLMs with black-box access such as GPT-4 and GPT-3.5, while requiring much less memory allocation and query iterations than other jailbreak approaches.

Cite

Text

Chen et al. "PANDORA: Detailed LLM Jailbreaking via Collaborated Phishing Agents with Decomposed Reasoning." ICLR 2024 Workshops: SeT_LLM, 2024.

Markdown

[Chen et al. "PANDORA: Detailed LLM Jailbreaking via Collaborated Phishing Agents with Decomposed Reasoning." ICLR 2024 Workshops: SeT_LLM, 2024.](https://mlanthology.org/iclrw/2024/chen2024iclrw-pandora/)

BibTeX

@inproceedings{chen2024iclrw-pandora,
  title     = {{PANDORA: Detailed LLM Jailbreaking via Collaborated Phishing Agents with Decomposed Reasoning}},
  author    = {Chen, Zhaorun and Zhao, Zhuokai and Qu, Wenjie and Wen, Zichen and Han, Zhiguang and Zhu, Zhihong and Zhang, Jiaheng and Yao, Huaxiu},
  booktitle = {ICLR 2024 Workshops: SeT_LLM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/chen2024iclrw-pandora/}
}