Backdoor Attacks with Input-Unique Triggers in NLP

Abstract

Backdoor attack aims to induce neural models to make incorrect predictions for poison data while keeping predictions on the clean dataset unchanged, which creates a considerable threat to current natural language processing (NLP) systems. Existing backdoor attacking systems face two severe issues: firstly, most backdoor triggers follow a uniform and usually input-independent pattern, e.g., insertion of specific trigger words. This significantly hinders the stealthiness of the attacking model, leading to the trained backdoor model being easily identified as malicious by model probes. Secondly, trigger-inserted poisoned sentences are usually disfluent, ungrammatical, or even change the semantic meaning from the original sentence. To resolve these two issues, we propose a method named NURA, where we generate backdoor triggers unique to inputs. NURA generates context-related triggers by continuing to write the input with a language model like GPT2 [ 2 ]. The generated sentence is used as the backdoor trigger. This strategy not only creates input-unique backdoor triggers but also preserves the semantics of the original input, simultaneously resolving the two issues above. Experimental results show that the NURA attack is effective for attack and difficult to defend against: it achieves a high attack success rate across all the widely applied benchmarks while being immune to existing defense methods.

Cite

Text

Zhou et al. "Backdoor Attacks with Input-Unique Triggers in NLP." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2024. doi:10.1007/978-3-031-70341-6_18

Markdown

[Zhou et al. "Backdoor Attacks with Input-Unique Triggers in NLP." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2024.](https://mlanthology.org/ecmlpkdd/2024/zhou2024ecmlpkdd-backdoor/) doi:10.1007/978-3-031-70341-6_18

BibTeX

@inproceedings{zhou2024ecmlpkdd-backdoor,
  title     = {{Backdoor Attacks with Input-Unique Triggers in NLP}},
  author    = {Zhou, Xukun and Li, Jiwei and Zhang, Tianwei and Lyu, Lingjuan and Yang, Muqiao and He, Jun},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2024},
  pages     = {296-312},
  doi       = {10.1007/978-3-031-70341-6_18},
  url       = {https://mlanthology.org/ecmlpkdd/2024/zhou2024ecmlpkdd-backdoor/}
}