Detecting Backdoors with Meta-Models

Abstract

It is widely known that it is possible to implant backdoors into neural networks, by which an attacker can choose an input to produce a particular undesirable output (e.g.\ misclassify an image). We propose to use \emph{meta-models}, neural networks that take another network's parameters as input, to detect backdoors directly from model weights. To this end we present a meta-model architecture and train it on a dataset of approx.\ 4000 clean and backdoored CNNs trained on CIFAR-10. Our approach is simple and scalable, and is able to detect the presence of a backdoor with $>99\%$ accuracy when the test trigger pattern is i.i.d., with some success even on out-of-distribution backdoors.

Cite

Text

Langosco et al. "Detecting Backdoors with Meta-Models." NeurIPS 2023 Workshops: BUGS, 2023.

Markdown

[Langosco et al. "Detecting Backdoors with Meta-Models." NeurIPS 2023 Workshops: BUGS, 2023.](https://mlanthology.org/neuripsw/2023/langosco2023neuripsw-detecting/)

BibTeX

@inproceedings{langosco2023neuripsw-detecting,
  title     = {{Detecting Backdoors with Meta-Models}},
  author    = {Langosco, Lauro and Alex, Neel and Baker, William and Quarel, David and Bradley, Herbie and Krueger, David},
  booktitle = {NeurIPS 2023 Workshops: BUGS},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/langosco2023neuripsw-detecting/}
}