Correct Explanations and How to Define Them: Properties and Metrics for Measuring Correctness of Three Forms of ML Model Input/Output Behaviour Explanations

Abstract

In explainable AI, many explanation methods generate similar yet diverging explanations for machine learning (ML) models. How fair is it then to explain ML model behaviour by such explanations? Arguably, one needs to judge whether those explanations are good at explaining ML model input/output behaviour. We here attempt to formalise ways to judge goodness of such explanations in terms of their correctness. For assessing correctness, one needs to have desirable properties of explanation correctness in mind, as well as was to measure satisfaction of those properties. We submit two high-level properties of soundness and completeness for assessing explanation correctness: explaining is sound if the model behaves the way the explanations say; explaining is complete if explanations can be given for model’s outputs on any inputs. We formulate soundness and completeness properties for three forms of explanations: feature importance, counterfactuals and rules. We further formalise multiple general metrics, at least one for each property and form of explanation, for quantitatively measuring satisfaction of soundness and completeness. We argue that explanations are correct in as much as various aspects of the different forms of explanations are met as quantified by those metrics. We hope that being able to assess correctness of ML model input/output behaviour explanations against formal properties and metrics is a substantial step towards fairly explaining ML-based inference.

Cite

Text

Singh et al. "Correct Explanations and How to Define Them: Properties and Metrics for Measuring Correctness of Three Forms of ML Model Input/Output Behaviour Explanations." Journal of Artificial Intelligence Research, 2025. doi:10.1613/JAIR.1.18691

Markdown

[Singh et al. "Correct Explanations and How to Define Them: Properties and Metrics for Measuring Correctness of Three Forms of ML Model Input/Output Behaviour Explanations." Journal of Artificial Intelligence Research, 2025.](https://mlanthology.org/jair/2025/singh2025jair-correct/) doi:10.1613/JAIR.1.18691

BibTeX

@article{singh2025jair-correct,
  title     = {{Correct Explanations and How to Define Them: Properties and Metrics for Measuring Correctness of Three Forms of ML Model Input/Output Behaviour Explanations}},
  author    = {Singh, Vandita and Cyras, Kristijonas and Akram, Muhammad Zain and Inam, Rafia},
  journal   = {Journal of Artificial Intelligence Research},
  year      = {2025},
  doi       = {10.1613/JAIR.1.18691},
  volume    = {84},
  url       = {https://mlanthology.org/jair/2025/singh2025jair-correct/}
}