Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions

Abstract

Interpretability researchers face a universal question: without access to ground truth labels, how can the faithfulness of an explanation to its model be determined? Despite immense efforts to develop new evaluation methods, current approaches remain in a pre-paradigmatic state: fragmented, difficult to calibrate, and lacking cohesive theoretical grounding. Observ- ing the lack of a unifying theory, we propose a novel evaluative criterion entitled Generalised Explanation Faithfulness (GEF) which is centered on explanation-to-model alignment, and integrates existing perturbation-based evaluations to eliminate the need for singular, task-specific evaluations. Complementing this unifying perspective, from a geometric point of view, we reveal a prevalent yet critical oversight in current evaluation practice: the failure to account for the learned geometry, and non-linear mapping present in the model, and explanation spaces. To solve this, we propose a general-purpose, threshold-free faithfulness evaluator GEF that incorporates principles from differential geometry, and facilitates evaluation agnostically across tasks, and interpretability approaches. Through extensive cross-domain benchmarks on natural language processing, vision, and tabular tasks, we provide first-of-its-kind insights into the comparative performance of various interpretable methods. This includes local linear approximators, global feature visualisation methods, large language models as post-hoc explainers, and sparse autoencoders. Our contributions are important to the interpretability and AI safety communities, offering a principled, unified approach for evaluation.

Cite

Text

Hedström et al. "Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions." Transactions on Machine Learning Research, 2025.

Markdown

[Hedström et al. "Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/hedstrom2025tmlr-evaluating/)

BibTeX

@article{hedstrom2025tmlr-evaluating,
  title     = {{Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions}},
  author    = {Hedström, Anna and Bommer, Philine Lou and Burns, Thomas F and Lapuschkin, Sebastian and Samek, Wojciech and Höhne, Marina MC},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/hedstrom2025tmlr-evaluating/}
}