BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Abstract

Large language models (LLMs) are shown to benefit from chain-of-thought (COT) prompting, particularly when tackling tasks that require systematic reasoning processes. On the other hand, COT prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output unintended malicious content under specific backdoor-triggered conditions during inference. In this paper, we propose BadChain, the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a *backdoor reasoning step* into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger is embedded in the query prompt. In particular, a subset of demonstrations will be manipulated to incorporate the backdoor reasoning step in COT prompting. Consequently, given any query prompt containing the backdoor trigger, the LLM will be misled to output unintended content. Empirically, we show the effectiveness of BadChain against four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) on six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning, compared with the ineffectiveness of the baseline backdoor attacks designed for simpler tasks such as semantic classification. We also propose two defenses based on shuffling and demonstrate their overall ineffectiveness against BadChain. Therefore, BadChain remains a severe threat to LLMs, underscoring the urgency for the development of effective future defenses.

Cite

Text

Xiang et al. "BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models." NeurIPS 2023 Workshops: BUGS, 2023.

Markdown

[Xiang et al. "BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models." NeurIPS 2023 Workshops: BUGS, 2023.](https://mlanthology.org/neuripsw/2023/xiang2023neuripsw-badchain/)

BibTeX

@inproceedings{xiang2023neuripsw-badchain,
  title     = {{BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models}},
  author    = {Xiang, Zhen and Jiang, Fengqing and Xiong, Zidi and Ramasubramanian, Bhaskar and Poovendran, Radha and Li, Bo},
  booktitle = {NeurIPS 2023 Workshops: BUGS},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/xiang2023neuripsw-badchain/}
}