Fooling GPT with Adversarial In-Context Examples for Text Classification

Abstract

Deep learning-based methods helped solve NLP tasks more efficiently than traditional methods, and adversarial attacks for these methods have been extensively explored. However, Large Language Models (LLMs) have set up a new paradigm of few-shot prompting, which opens up the possibility for novel attacks. In this study, we show that LLMs can be vulnerable to adversarial prompts. We develop the first method to attack the few-shot examples in the text classification setup. We can degrade the model performance significantly during the test time by only slightly perturbing the examples based on optimization. Our method achieves a performance degradation of up to 50% without distorting the semantic meaning.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Ranjan et al. "Fooling GPT with Adversarial In-Context Examples for Text Classification." NeurIPS 2023 Workshops: R0-FoMo, 2023.

Markdown

[Ranjan et al. "Fooling GPT with Adversarial In-Context Examples for Text Classification." NeurIPS 2023 Workshops: R0-FoMo, 2023.](https://mlanthology.org/neuripsw/2023/ranjan2023neuripsw-fooling/)

BibTeX

@inproceedings{ranjan2023neuripsw-fooling,
  title     = {{Fooling GPT with Adversarial In-Context Examples for Text Classification}},
  author    = {Ranjan, Sudhanshu and Sun, Chung-En and Liu, Linbo and Weng, Tsui-Wei},
  booktitle = {NeurIPS 2023 Workshops: R0-FoMo},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/ranjan2023neuripsw-fooling/}
}