The Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style Attributes

Abstract

Backdoor attacks on text classifiers cause them to predict a predefined label when a particular "trigger" is present. Prior attacks often rely on triggers that are ungrammatical or otherwise unusual. In practice, human annotators, who play a critical role in curating training data, can easily detect and filter out these unnatural texts during manual inspection, reducing the risk of such attacks. We argue that a key criterion for a successful attack is for text with and without triggers to be indistinguishable to humans. However, prior work neither directly nor comprehensively evaluates attack subtlety and invisibility with human involvement. We bridge the gap by conducting thorough human evaluations to assess attack subtlety. We also propose \textbf{AttrBkd} consisting of three recipes for crafting effective trigger attributes, such as extracting fine-grained attributes from existing baseline backdoor attacks. Our human evaluations find that AttrBkd with these baseline-derived attributes is often more effective (higher attack success rate) and more subtle (fewer instances detected by humans) than the original baseline backdoor attacks, demonstrating that backdoor attacks can bypass detection by being subtle and appearing natural even upon close inspection, while still remaining effective. Our human annotation also provides information not captured by automated metrics used in prior work, and demonstrates the misalignment of these metrics with human judgment.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

You and Lowd. "The Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style Attributes." NeurIPS 2024 Workshops: AdvML-Frontiers, 2024.

Markdown

[You and Lowd. "The Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style Attributes." NeurIPS 2024 Workshops: AdvML-Frontiers, 2024.](https://mlanthology.org/neuripsw/2024/you2024neuripsw-ultimate/)

BibTeX

@inproceedings{you2024neuripsw-ultimate,
  title     = {{The Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style Attributes}},
  author    = {You, Wencong and Lowd, Daniel},
  booktitle = {NeurIPS 2024 Workshops: AdvML-Frontiers},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/you2024neuripsw-ultimate/}
}