Adversarial Training for Defense Against Label Poisoning Attacks
Abstract
As machine learning models grow in complexity and increasingly rely on publicly sourced data, such as the human-annotated labels used in training large language models, they become more vulnerable to label poisoning attacks. These attacks, in which adversaries subtly alter the labels within a training dataset, can severely degrade model performance, posing significant risks in critical applications. In this paper, we propose $\textbf{Floral}$, a novel adversarial training defense strategy based on support vector machines (SVMs) to counter these threats. Utilizing a bilevel optimization framework, we cast the training process as a non-zero-sum Stackelberg game between an $\textit{attacker}$, who strategically poisons critical training labels, and the $\textit{model}$, which seeks to recover from such attacks. Our approach accommodates various model architectures and employs a projected gradient descent algorithm with kernel SVMs for adversarial training. We provide a theoretical analysis of our algorithm’s convergence properties and empirically evaluate $\textbf{Floral}$'s effectiveness across diverse classification tasks. Compared to robust baselines and foundation models such as RoBERTa, $\textbf{Floral}$ consistently achieves higher robust accuracy under increasing attacker budgets. These results underscore the potential of $\textbf{Floral}$ to enhance the resilience of machine learning models against label poisoning threats, thereby ensuring robust classification in adversarial settings.
Cite
Text
Bal et al. "Adversarial Training for Defense Against Label Poisoning Attacks." International Conference on Learning Representations, 2025.Markdown
[Bal et al. "Adversarial Training for Defense Against Label Poisoning Attacks." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/bal2025iclr-adversarial/)BibTeX
@inproceedings{bal2025iclr-adversarial,
title = {{Adversarial Training for Defense Against Label Poisoning Attacks}},
author = {Bal, Melis Ilayda and Cevher, Volkan and Muehlebach, Michael},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/bal2025iclr-adversarial/}
}