The Price of Freedom: An Adversarial Attack on Interpretability Evaluation

Abstract

The absence of ground truth explanation labels poses a key challenge for quantitative evaluation in interpretable AI (IAI), particularly when evaluation methods involve numerous user-specified hyperparameters. Without ground truth, optimising hyperparameter selection is difficult, often leading researchers to make choices based on similar studies, which offers considerable flexibility. We show how this flexibility can be exploited to manipulate evaluation outcomes by framing it as an adversarial attack where minor hyperparameter adjustments lead to significant changes in results. Our experiments demonstrate substantial variations in evaluation outcomes across multiple datasets, explanation methods, and models. To counteract this, we propose a ranking-based mitigation strategy that enhances robustness against such manipulations. This work underscores the challenges of reliable evaluation in IAI. Code is available at \url{https://github.com/Wickstrom/quantitative-IAI-manipulation}.

Cite

Text

Wickstrøm et al. "The Price of Freedom: An Adversarial Attack on Interpretability Evaluation." NeurIPS 2024 Workshops: InterpretableAI, 2024.

Markdown

[Wickstrøm et al. "The Price of Freedom: An Adversarial Attack on Interpretability Evaluation." NeurIPS 2024 Workshops: InterpretableAI, 2024.](https://mlanthology.org/neuripsw/2024/wickstrm2024neuripsw-price/)

BibTeX

@inproceedings{wickstrm2024neuripsw-price,
  title     = {{The Price of Freedom: An Adversarial Attack on Interpretability Evaluation}},
  author    = {Wickstrøm, Kristoffer Knutsen and Höhne, Marina MC and Hedström, Anna},
  booktitle = {NeurIPS 2024 Workshops: InterpretableAI},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/wickstrm2024neuripsw-price/}
}