Learning to Generate Inversion-Resistant Model Explanations

Jeong, Hoyong; Lee, Suyoung; Hwang, Sung Ju; Son, Sooel

Learning to Generate Inversion-Resistant Model Explanations

Hoyong Jeong, Suyoung Lee, Sung Ju Hwang, Sooel Son

NeurIPS 2022

/neurips/2022/jeong2022neurips-learning/

Abstract

The wide adoption of deep neural networks (DNNs) in mission-critical applications has spurred the need for interpretable models that provide explanations of the model's decisions. Unfortunately, previous studies have demonstrated that model explanations facilitate information leakage, rendering DNN models vulnerable to model inversion attacks. These attacks enable the adversary to reconstruct original images based on model explanations, thus leaking privacy-sensitive features. To this end, we present Generative Noise Injector for Model Explanations (GNIME), a novel defense framework that perturbs model explanations to minimize the risk of model inversion attacks while preserving the interpretabilities of the generated explanations. Specifically, we formulate the defense training as a two-player minimax game between the inversion attack network on the one hand, which aims to invert model explanations, and the noise generator network on the other, which aims to inject perturbations to tamper with model inversion attacks. We demonstrate that GNIME significantly decreases the information leakage in model explanations, decreasing transferable classification accuracy in facial recognition models by up to 84.8% while preserving the original functionality of model explanations.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Jeong et al. "Learning to Generate Inversion-Resistant Model Explanations." Neural Information Processing Systems, 2022.

Markdown

[Jeong et al. "Learning to Generate Inversion-Resistant Model Explanations." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/jeong2022neurips-learning/)

BibTeX

@inproceedings{jeong2022neurips-learning,
  title     = {{Learning to Generate Inversion-Resistant Model Explanations}},
  author    = {Jeong, Hoyong and Lee, Suyoung and Hwang, Sung Ju and Son, Sooel},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/jeong2022neurips-learning/}
}