Provably Safeguarding a Classifier from OOD and Adversarial Samples

Abstract

This paper aims to transform a trained classifier into an abstaining classifier, such that the latter is provably protected from out-of-distribution and adversarial samples. The proposed Sample-efficient Probabilistic Detection using Extreme Value Theory (SPADE) approach relies on a Generalized Extreme Value (GEV) model of the training distribution in the latent space of the classifier. Under mild assumptions, this GEV model allows for formally characterizing out-of-distribution and adversarial samples and rejecting them. Empirical validation of the approach is conducted on various neural architectures (ResNet, VGG, and Vision Transformer) and considers medium and large-sized datasets (CIFAR-10, CIFAR-100, and ImageNet). The results show the stability and frugality of the GEV model and demonstrate SPADE’s efficiency compared to the state-of-the-art methods.

Cite

Text

Atienza et al. "Provably Safeguarding a Classifier from OOD and Adversarial Samples." International Conference on Learning Representations, 2025.

Markdown

[Atienza et al. "Provably Safeguarding a Classifier from OOD and Adversarial Samples." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/atienza2025iclr-provably/)

BibTeX

@inproceedings{atienza2025iclr-provably,
  title     = {{Provably Safeguarding a Classifier from OOD and Adversarial Samples}},
  author    = {Atienza, Nicolas and Cohen, Johanne and Labreuche, Christophe and Sebag, Michele},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/atienza2025iclr-provably/}
}