A Simple Scoring Function to Fool SHAP: Stealing from the One Above

Abstract

Explainable Al (XAl) methods such as SHAP can help discover unfairness in black-box models. If the XAl method reveals a significant impact from a "protected attribute" (e.g., gender, race) on the model output, the model is considered unfair. However, adversarial models can subvert the detection of XAI methods. Previous approaches to constructing such an adversarial model require access to underlying data distribution. We propose a simple rule that does not require access to the underlying data or data distribution. It can adapt any scoring function to fool XAl methods, such as SHAP. Our work calls for more attention to scoring functions besides classifiers in XAl research and reveals the limitations of XAl methods for explaining behaviors of scoring functions.

Cite

Text

Yuan and Dasgupta. "A Simple Scoring Function to Fool SHAP: Stealing from the One Above." NeurIPS 2023 Workshops: XAIA, 2023.

Markdown

[Yuan and Dasgupta. "A Simple Scoring Function to Fool SHAP: Stealing from the One Above." NeurIPS 2023 Workshops: XAIA, 2023.](https://mlanthology.org/neuripsw/2023/yuan2023neuripsw-simple/)

BibTeX

@inproceedings{yuan2023neuripsw-simple,
  title     = {{A Simple Scoring Function to Fool SHAP: Stealing from the One Above}},
  author    = {Yuan, Jun and Dasgupta, Aritra},
  booktitle = {NeurIPS 2023 Workshops: XAIA},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/yuan2023neuripsw-simple/}
}