Detecting Adversarial Examples Is (Nearly) as Hard as Classifying Them

Abstract

Making classifiers robust to adversarial examples is hard. Thus, many defenses tackle the seemingly easier task of \emph{detecting} perturbed inputs. We show a barrier towards this goal. We prove a general \emph{hardness reduction} between detection and classification of adversarial examples: given a robust detector for attacks at distance $\epsilon$ (in some metric), we can build a similarly robust (but inefficient) \emph{classifier} for attacks at distance $\epsilon/2$. Our reduction is computationally inefficient, and thus cannot be used to build practical classifiers. Instead, it is a useful sanity check to test whether empirical detection results imply something much stronger than the authors presumably anticipated. %(indeed, building inefficient robust classifiers is also presumed to be very challenging). To illustrate, we revisit $13$ detector defenses. For $10/13$ cases, we show that the claimed detection results would imply an inefficient classifier with robustness far beyond the state-of-the-art.

Cite

Text

Tramer. "Detecting Adversarial Examples Is (Nearly) as Hard as Classifying Them." ICML 2021 Workshops: AML, 2021.

Markdown

[Tramer. "Detecting Adversarial Examples Is (Nearly) as Hard as Classifying Them." ICML 2021 Workshops: AML, 2021.](https://mlanthology.org/icmlw/2021/tramer2021icmlw-detecting/)

BibTeX

@inproceedings{tramer2021icmlw-detecting,
  title     = {{Detecting Adversarial Examples Is (Nearly) as Hard as Classifying Them}},
  author    = {Tramer, Florian},
  booktitle = {ICML 2021 Workshops: AML},
  year      = {2021},
  url       = {https://mlanthology.org/icmlw/2021/tramer2021icmlw-detecting/}
}