Provably Safeguarding a Classifier from OOD and Adversarial Samples
Abstract
This paper aims to transform a trained classifier into an abstaining classifier, such that the latter is provably protected from out-of-distribution and adversarial samples. The proposed Sample-efficient Probabilistic Detection using Extreme Value Theory (SPADE) approach relies on a Generalized Extreme Value (GEV) model of the training distribution in the latent space of the classifier. Under mild assumptions, this GEV model allows for formally characterizing out-of-distribution and adversarial samples and rejecting them. Empirical validation of the approach is conducted on various neural architectures (ResNet, VGG, and Vision Transformer) and considers medium and large-sized datasets (CIFAR-10, CIFAR-100, and ImageNet). The results show the stability and frugality of the GEV model and demonstrate SPADE’s efficiency compared to the state-of-the-art methods.
Cite
Text
Atienza et al. "Provably Safeguarding a Classifier from OOD and Adversarial Samples." International Conference on Learning Representations, 2025.Markdown
[Atienza et al. "Provably Safeguarding a Classifier from OOD and Adversarial Samples." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/atienza2025iclr-provably/)BibTeX
@inproceedings{atienza2025iclr-provably,
title = {{Provably Safeguarding a Classifier from OOD and Adversarial Samples}},
author = {Atienza, Nicolas and Cohen, Johanne and Labreuche, Christophe and Sebag, Michele},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/atienza2025iclr-provably/}
}