No, of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data
Abstract
Leading language model (LM) providers like OpenAI and Google offer fine-tuning APIs that allow customers to adapt LMs for specific use cases. To prevent misuse, these LM providers implement filtering mechanisms to block harmful fine-tuning data. Consequently, adversaries seeking to produce unsafe LMs via these APIs must craft adversarial training data that are not identifiably harmful. We make three contributions in this context: 1. We show that many existing attacks that use harmless data to create unsafe LMs rely on eliminating model refusals in the first few tokens of their responses. 2. We show that such prior attacks can be blocked by a simple defense that pre-fills the first few tokens from an aligned model before letting the fine-tuned model fill in the rest. 3. We describe a new data-poisoning attack, ``No, Of course I Can Execute'' (NOICE), which exploits an LM's formulaic refusal mechanism to elicit harmful responses. By training an LM to refuse benign requests on the basis of safety before fulfilling those requests regardless, we are able to jailbreak several open-source models and a closed-source model (GPT-4o). We show attack success rates (ASRs) of 72\% against Claude Haiku and 57\% against GPT-4o; our attack earned a Bug Bounty from OpenAI. Against open-source models protected by simple defenses, we improve ASRs by a factor of $3.5$ times compared to other attacks that use only harmless data. NOICE demonstrates the exploitability of repetitive refusal mechanisms and broadens understanding of the threats closed-source models face from harmless data.
Cite
Text
Kazdan et al. "No, of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data." ICLR 2025 Workshops: BuildingTrust, 2025.Markdown
[Kazdan et al. "No, of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/kazdan2025iclrw-course/)BibTeX
@inproceedings{kazdan2025iclrw-course,
title = {{No, of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data}},
author = {Kazdan, Joshua and Yu, Lisa and Schaeffer, Rylan and Cundy, Chris and Koyejo, Sanmi and Dvijotham, Krishnamurthy Dj},
booktitle = {ICLR 2025 Workshops: BuildingTrust},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/kazdan2025iclrw-course/}
}