Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

Abstract

Aligned large language models (LLMs) can falsely refuse pseudo-harmful user prompts, like "how to kill a mosquito," which seem harmful but are actually not. Frequent false refusals not only affect user experience but also cause the public to disdain the values alignment seeks to protect. In this paper, we propose the first method for auto-generating pseudo-harmful prompts, leveraging a white-box LLM to generate natural, varied, and controllable prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately annotates controversial samples. We evaluate 14 models, including Claude 3, on PHTest, uncovering new insights due to its scale and fine-grained annotations. Additionally, we reveal a trade-off between false refusals and safety against jailbreak attacks. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs.

Cite

Text

An et al. "Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models." ICML 2024 Workshops: NextGenAISafety, 2024.

Markdown

[An et al. "Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models." ICML 2024 Workshops: NextGenAISafety, 2024.](https://mlanthology.org/icmlw/2024/an2024icmlw-automatic/)

BibTeX

@inproceedings{an2024icmlw-automatic,
  title     = {{Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models}},
  author    = {An, Bang and Zhu, Sicheng and Zhang, Ruiyi and Panaitescu-Liess, Michael-Andrei and Xu, Yuancheng and Huang, Furong},
  booktitle = {ICML 2024 Workshops: NextGenAISafety},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/an2024icmlw-automatic/}
}