On the Existence of a Trojaned Twin Model

Abstract

We study the Trojan Attack problem, where malicious attackers sabotage deep neural network models with poisoned training data. In most existing works, the effectiveness of the attack is largely overlooked; many attacks can be ineffective or inefficient for certain training schemes, e.g., adversarial training. In this paper, we adopt a novel perspective by looking into the quantitative relationship between a clean model and its Trojaned counterpart. We formulate a successful attack using classic machine learning language, namely a universal Trojan trigger intrinsic to the data distribution. Theoretically, we prove that, under mild assumptions, there exists a Trojaned model, named Trojaned Twin, that is very close to the clean model in the output space. Practically, we show that these results have powerful implications since the Trojaned twin model has enhanced attack efficacy and strong resiliency against detection. Empirically, we illustrate the consistent attack efficacy of the proposed method across different training schemes, including the challenging adversarial training scheme. Furthermore, we show that this Trojaned twin model is robust against SoTA detection methods.

Cite

Text

Zheng et al. "On the Existence of a Trojaned Twin Model." ICLR 2023 Workshops: BANDS, 2023.

Markdown

[Zheng et al. "On the Existence of a Trojaned Twin Model." ICLR 2023 Workshops: BANDS, 2023.](https://mlanthology.org/iclrw/2023/zheng2023iclrw-existence/)

BibTeX

@inproceedings{zheng2023iclrw-existence,
  title     = {{On the Existence of a Trojaned Twin Model}},
  author    = {Zheng, Songzhu and Zhang, Yikai and Pang, Lu and Lyu, Weimin and Goswami, Mayank and Schneider, Anderson and Nevmyvaka, Yuriy and Ling, Haibin and Chen, Chao},
  booktitle = {ICLR 2023 Workshops: BANDS},
  year      = {2023},
  url       = {https://mlanthology.org/iclrw/2023/zheng2023iclrw-existence/}
}