Thwarting Finite Difference Adversarial Attacks with Output Randomization
Abstract
Adversarial input poses a critical problem to deep neural networks (DNN). This problem is more severe in the "black box" setting where an adversary only needs to repeatedly query a DNN to estimate the gradients required to create adversarial examples. Current defense techniques against attacks in this setting are not effective. Thus, in this paper, we present a novel defense technique based on randomization applied to a DNN's output layer. While effective as a defense technique, this approach introduces a trade off between accuracy and robustness. We show that for certain types of randomization, we can bound the probability of introducing errors by carefully setting distributional parameters. For the particular case of finite difference black box attacks, we quantify the error introduced by the defense in the finite difference estimate of the gradient. Lastly, we show empirically that the defense can thwart three adaptive black box adversarial attack algorithms.
Cite
Text
Khan et al. "Thwarting Finite Difference Adversarial Attacks with Output Randomization." International Conference on Learning Representations, 2020.Markdown
[Khan et al. "Thwarting Finite Difference Adversarial Attacks with Output Randomization." International Conference on Learning Representations, 2020.](https://mlanthology.org/iclr/2020/khan2020iclr-thwarting/)BibTeX
@inproceedings{khan2020iclr-thwarting,
title = {{Thwarting Finite Difference Adversarial Attacks with Output Randomization}},
author = {Khan, Haidar and Park, Daniel and Khan, Azer and Yener, Bülent},
booktitle = {International Conference on Learning Representations},
year = {2020},
url = {https://mlanthology.org/iclr/2020/khan2020iclr-thwarting/}
}