Robustness Distributions in Neural Network Verification

Abstract

Neural networks are vulnerable to slight alterations to otherwise correctly classified inputs, leading to incorrect predictions. To rigorously assess the robustness of neural networks against such perturbations, verification techniques can be employed. Robustness is generally measured in terms of adversarial accuracy, based on an upper bound on the magnitude of perturbations commonly denoted by ε. For each input in a given set, a verifier determines whether a perturbation up to magnitude ε can deceive the network. In this work, we contribute novel analysis techniques for the verified robustness of neural networks for supervised classification problems and report on interesting findings we obtained using these techniques. We utilise the notion of robustness distributions, specifically those built using the concept of critical ε values. Critical ε values are defined as the maximum amount of perturbation for which a given input is provably correctly classified, such that any larger perturbations can cause misclassification. To effectively estimate the critical ε values for each input in a given set, we utilise a variant of the binary search algorithm. We then analyse the distributions of these critical ε values over a given set of inputs for 12 MNIST classifiers widely used in the literature on neural network verification. Using a Kolmogorov-Smirnov test, we obtain support for the hypothesis that the critical ε values of 11 of these networks follow a log-normal distribution. Furthermore, we found no statistically significant differences between the critical ε distributions for training and testing data for 12 feed-forward neural networks on the MNIST dataset. Generally, we find a strong positive correlation between the critical ε of an input image across various networks. However, in some cases, an input that is easily perturbed to deceive one network may require a considerably larger perturbation to deceive another. Furthermore, for a given input, the adversarial examples that we find differ across networks, with different predicted classes associated with them. We investigate the effect adversarial training can have on the critical ε distribution of various neural networks for MNIST, CIFAR and GTSRB datasets. We also find that complete verification is expensive for some of the CIFAR and GTSRB networks, which limits the precision of the robustness distributions we were able to obtain. Nonetheless, we observe that most of the critical ε distributions of the networks obtained through adversarial training do not follow a log-normal distribution. Furthermore, adversarial training significantly improves the critical ε distributions for testing as well as training data in most cases. Lastly, we provide a ready-to-use Python package available on GitHub that can be used for creating robustness distributions and enables others to build upon our work.

Cite

Text

Bosman et al. "Robustness Distributions in Neural Network Verification." Journal of Artificial Intelligence Research, 2025. doi:10.1613/JAIR.1.18403

Markdown

[Bosman et al. "Robustness Distributions in Neural Network Verification." Journal of Artificial Intelligence Research, 2025.](https://mlanthology.org/jair/2025/bosman2025jair-robustness/) doi:10.1613/JAIR.1.18403

BibTeX

@article{bosman2025jair-robustness,
  title     = {{Robustness Distributions in Neural Network Verification}},
  author    = {Bosman, Annelot W. and Berger, Aaron and Hoos, Holger H. and van Rijn, Jan N.},
  journal   = {Journal of Artificial Intelligence Research},
  year      = {2025},
  doi       = {10.1613/JAIR.1.18403},
  volume    = {83},
  url       = {https://mlanthology.org/jair/2025/bosman2025jair-robustness/}
}