Is Poisoning a Real Threat to DPO? Maybe More so than You Think
Abstract
Recent advancements in Reinforcement Learning with Human Feedback (RLHF) have significantly impacted the alignment of Large Language Models (LLMs). The sensitivity of reinforcement learning algorithms such as Proximal Policy Optimization (PPO) has led to new line work on Direct Preference Optimization (DPO), which treats RLHF in a supervised learning framework. The increased practical use of these RLHF methods warrants an analysis of their vulnerabilities. In this work, we investigate the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning, a first of its kind. We comprehensively analyze DPO's vulnerabilities under different types of attacks, i.e., backdoor and non-backdoor attacks, and different poisoning methods across a wide array of language models, i.e., LLama 7B, Mistral 7B, and Gemma 7B. We find that unlike PPO-based methods, which, when it comes to backdoor attacks, require at least 4% of the data to be poisoned to elicit harmful behavior, we exploit the vulnerabilities of DPO by simpler methods so we can poison the model with only as much as 0.5% of the data. We further the investigate efficacy of the existing defence methods and find that these poisoning attacks can evade the existing data anomaly detection methods.
Cite
Text
Pathmanathan et al. "Is Poisoning a Real Threat to DPO? Maybe More so than You Think." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I26.34968Markdown
[Pathmanathan et al. "Is Poisoning a Real Threat to DPO? Maybe More so than You Think." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/pathmanathan2025aaai-poisoning/) doi:10.1609/AAAI.V39I26.34968BibTeX
@inproceedings{pathmanathan2025aaai-poisoning,
title = {{Is Poisoning a Real Threat to DPO? Maybe More so than You Think}},
author = {Pathmanathan, Pankayaraj and Chakraborty, Souradip and Liu, Xiangyu and Liang, Yongyuan and Huang, Furong},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {27556-27564},
doi = {10.1609/AAAI.V39I26.34968},
url = {https://mlanthology.org/aaai/2025/pathmanathan2025aaai-poisoning/}
}